Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models

In recent years, the research into cluster-weighted models has been intense. However, estimating the covariance matrix of the maximum likelihood estimator under a cluster-weighted model is still an open issue. Here, an approach is developed in which information-based estimators of such a covariance matrix are obtained from the incomplete data log-likelihood of the multivariate Gaussian linear cluster-weighted model. To this end, analytical expressions for the score vector and Hessian matrix are provided. Three estimators of the asymptotic covariance matrix of the maximum likelihood estimator, based on the score vector and Hessian matrix, are introduced. The performances of these estimators are numerically evaluated using simulated datasets in comparison with a bootstrap-based estimator; their usefulness is illustrated through a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions.


Introduction
Cluster-weighted models constitute an approach to regression analysis with random covariates in which supervised (regression) and unsupervised (model-based cluster analysis) learning methods are jointly exploited (Hastie et al., 2009). In this approach, a given random vector is assumed to be composed of an outcome Y (response, dependent variable) and its explanatory variables X (covariates, predictors). Furthermore, sample observations are allowed to come from a population composed of several unknown sub-populations. Finally, the joint distribution of the outcome and the covariates is modelled through a finite mixture model specified so as to account for a different effect of the covariates on the response within each sub-population. Thus, cluster-weighted models are useful to perform model-based cluster analysis in a multivariate regression setting with random covariates and unobserved heterogeneity.
Since the introduction of this approach (Gershenfeld, 1997), the research into clusterweighted models has been intense, especially in the last 8 years. Models for continuous variables under normal mixture models have been proposed by Ingrassia et al. (2012) and Dang et al. (2017). Robustified solutions have been developed by Ingrassia et al. (2014) and Punzo and McNicholas (2017), based on the use of Student t and contaminated normal mixture distributions, respectively. Punzo and Ingrassia (2013), Punzo and Ingrassia (2016), Ingrassia et al. (2015) and Di Mari et al. (2020) have introduced models for various types of responses. Models able to deal with non-linear relationships or many covariates have been proposed by Punzo (2014), Subedi et al. (2013) and Subedi et al. (2015).
By focusing the attention on Gaussian cluster-weighted models in which the effects of the covariates on the response within each sub-population are linear, model parameters are generally estimated through the maximum likelihood (ML) method by resorting to the expectation-maximisation (EM) algorithm (Dempster et al., 1977), which is widely employed in incomplete-data problems. In these models, the observed data S = {(x i , y i ), i = 1, . . . , I } are incomplete because the specific component density that generates the I sample observations is missing. This missing information is modelled through an unobserved variable coming from a pre-specified multinomial distribution and is added to the observed data so as to form the so-called complete data. Then, the ML estimate is computed from the maximisation of the complete data log-likelihood. A description of the EM algorithm for the linear Gaussian cluster-weighted model can be found in Dang et al. (2017). Specific functions implementing such algorithm for models with a univariate response are available in the package flexCWM (Mazza et al., 2018) for the R software environment (R Core Team, 2020).
A by-product of this estimating approach is a set of K estimated posterior probabilities that each sample observation comes from the K Gaussian distributions of the mixture. Thus, a by-product of fitting a linear Gaussian cluster-weighted model is a clustering of the I sample observations, based on a rule that assigns an observation to the distribution of the mixture to which it has the highest posterior probability of belonging. However, an estimating approach based on the use of an EM algorithm has the drawback of not automatically producing any estimate of the covariance matrix of the ML estimator. The assessment of the sample variability of the parameter estimates in a linear Gaussian cluster-weighted model is a necessary step in the subsequent development of inference methods for the model parameters, such as asymptotic confidence intervals, tests for the significance of the effect of any covariate on a given response within any sub-population and tests for the significance of the difference between the effects of the same covariate on a given response in two different sub-populations. Thus, additional computations are necessary to obtain an assessment of the sample variability of model parameter estimates. To the best of the author's knowledge, the only solution currently available for the linear Gaussian cluster-weighted models is implemented in the flexCWM package, where approximated standard errors are only provided for the intercepts and regression coefficients according to an approach in which a number of separate linear regression analyses with random covariates are carried out (one for each component of the mixture), and the sample observations are weighted with their estimated posterior probabilities of coming from the different components of the mixture. However, this approach only partially exploits the sample information about the parameters under a linear normal cluster-weighted model. Thus, other approaches could be investigated and detected. A solution can be obtained by resorting to bootstrap methods (see, e.g., Newton & Raftery 1994;Basford et al. 1997;McLachlan & Peel 2000). However, the overall computational process associated with the use of bootstrap techniques can become particularly time-consuming and complex because of difficulties typically associated with the fitting of finite mixture models (e.g. label-switching problems, possible convergence failures of the EM algorithm on the bootstrap samples). These inconveniences could be avoided through an approach in which the observed information matrix is obtained from the incomplete data log-likelihood and employed to compute information-based estimators of the covariance matrix of the ML estimator (see e.g. McLachlan & Peel 2000). This task has been successfully carried out under normal mixture models (Boldea & Magnus, 2009) and clusterwise linear regression models (Galimberti et al., 2021).
In order to make it possible to properly assess both the variability of and the covariance between ML estimates of all the parameters under multivariate linear normal clusterweighted models with a multivariate response, the gradient vector and second-order derivative matrix of the incomplete data log-likelihood for these models are explicitly derived here. Then, these results are used to obtain three estimators of the observed information matrix and the covariance matrix of the ML estimator. Properties of such estimators are numerically investigated using simulated datasets in comparison with the parametric bootstrap and the approach implemented in flexCWM. A numerical evaluation of the relationships between the estimators introduced here and those described by Boldea and Magnus (2009) is also provided. The practical usefulness of the proposed estimators is illustrated in a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions.
The remainder of the paper is organised as follows. Section 2 provides the definition of multivariate Gaussian linear cluster-weighted model together with some quantities employed in the derivation of the score vector and the Hessian matrix. Section 3 describes the estimators of the information matrix and the covariance matrix of the ML estimator. Sections 4 and 5 summarize the main results obtained from the analysis of simulated and real datasets, respectively. The analytical expressions of the score vector and the Hessian matrix are reported in Appendix. Technical details and additional results from the analysis of simulated datasets can be found in a separate document as supplementary materials.

Score Vector and Hessian Matrix of Gaussian Linear Cluster-Weighted Models
Let X = (X 1 , ..., X p ) and Y = (Y 1 , ..., Y q ) be two continuous random vectors with joint probability density function (p.d.f.) f (x, y). The response vector Y and the covariate vector X take values in R q and R p , respectively. Following Dang et al. (2017) where 1 , . . . , G denote the G unknown sub-populations ( g ∩ j = ∅ ∀g = j ), π g = P( g ), π g > 0 ∀g, G g=1 π g = 1, f (x| g ) is the conditional p.d.f. of X given g , f (y|x, g ) is the conditional p.d.f. of Y given X and g . A Gaussian linear clusterweighted model is obtained from Eq. 1 by additionally assuming that the distributions of X| g and Y|(X = x, g ) are Gaussian for g = 1, . . . , G and the effect of X on Y for any g is linear. By embedding all these assumptions into model (1) where φ d (·; μ, ) represents the p.d.f. of a normal d-dimensional random vector with expected value μ and positive definite covariance matrix , with B g ∈ R (1+p)×q , x * = (1, x ) , and ϑ is the vector of the unknown parameters. It has been proved that linear Gaussian cluster-weighted models of order G define the same family of probability distributions generated by mixtures of G Gaussian models for the random vector Z = (X , Y ) (Ingrassia et al., 2012). However, it is important to stress that this latter type of mixtures cannot be employed to account for local linear dependencies between X and Y.
The score vector and Hessian matrix of model (2) are derived by taking account of the fact that the weights π 1 , . . . , π G sum to one and the covariance matrices are symmetric. The first constraint is introduced in the maximization of the likelihood function by differentiating with respect to π = (π 1 , . . . , π G−1 ) and by setting π G = 1 − π 1 − · · · − π G−1 . The constraints on the covariance matrices are dealt with by using the operators vec(·), v(·) and the duplication matrix. Namely, vec(B) is the column vector obtained by stacking the columns of matrix B one underneath the other. v( ) denotes the column vector obtained from vec( ) by eliminating all supradiagonal elements of a symmetric matrix (thus, v( ) contains only the lower triangular part of ). The duplication matrix G is the unique matrix which transforms v( ) into vec( ) (Gv( ) = vec( )) (see e.g. Magnus & Neudecker 1988). Using this notation, the vector of the unknown parameters in model (2) can be denoted as ϑ = (π , θ 1 , . . . , θ G ) , where θ g = μ X g , v X g , vec B g , v Y g . Suppose that the observed data S = {(x i , y i ), i = 1, . . . , I } is composed of I independent and identically distributed observations. Then, the incomplete log-likelihood function under the model (2) is Each sample observation provides its own contribution to the gth term of the sum that defines the mixture (2). As far as the contribution of the ith observation is concerned, it is given by: By exploiting properties of the logarithm and trace, the following equality holds true: The explicit forms of the score vector and Hessian matrix, as developed here, require the introduction of some additional notation. Namely, let where e g denotes the gth column of the identity matrix of order G − 1 and 1 G−1 is the (G−1)×1 vector having each element equal to 1. The following quantities are also required: The explicit forms of the score vector S(ϑ) and Hessian matrix H (ϑ) for a Gaussian linear cluster-weighted model are provided in Theorems 1 and 2 (see Appendix). Proofs can be found in the document with the supplementary materials.

Covariance Matrix Estimation of the ML Estimator
In the ML approach, the information matrix I(ϑ) plays a crucial role, as it is used to asymptotically estimate the covariance of the ML estimator of the model parameters. Under regularity conditions and if the model is correctly specified, I(ϑ) is given either by the covariance of the score function E S(ϑ)S(ϑ) or the negative of the expected value of the Hessian matrix −E (H (ϑ)). Clearly, an analytical evaluation of the expectations required to obtain I(ϑ) under model (2) is quite complex. By exploiting some asymptotic results concerning ML estimation (White, 1982), it is possible to obtain the following asymptotic estimators of I(ϑ): where S i (θ) and H i (θ) denote the contribution of the ith sample observation to the score function and Hessian matrix evaluated at the ML estimateθ, respectively. They can be used to obtain the following asymptotic estimators of Cov(θ), the covariance matrix ofθ: Cov 2 (θ) = I −1 2 .
Under suitable conditions (see e.g. Newey & McFadden 1994;Ritter 2015, for a general discussion and some results specifically dealing with finite mixture models, respectively), Cov 1 (θ) and Cov 2 (θ) can be considered consistent estimators of Cov(θ) when the model is correctly specified. By exploiting the so-called sandwich approach (see e.g. White 1980), the following robust estimator can also be employed: It is worth noting that for the existence of the estimators Cov 2 (θ) and Cov 3 (θ) it is required that matrix I 2 is positive definite and well conditioned. The same requirements have to be fulfilled by matrix I 1 in order to guarantee that Cov 1 (θ) exists. With large-scale covariance matrices and small sample sizes, I 1 and/or I 2 could be ill-conditioned. These situations can be managed by resorting to procedures able to produce improved estimators of I(ϑ) from either I 1 or I 2 . For example, the algorithm by Higham (1988) computes the nearest positive definite matrix of a given symmetric matrix by adjusting its eigenvalues. Other approaches which exploit techniques of variance reduction could also be adopted (see e.g. Schäfer and Strimmer 2005).

Numerical Study of the Properties of the Proposed Estimators
In order to evaluate the properties of Cov 1 (θ), Cov 2 (θ) and Cov 3 (θ) in comparison with the estimators based on the parametric bootstrap and the approach implemented in flexCWM, five Monte Carlo studies have been performed. In the first study, the artificial datasets have been generated under a model defined by Eqs. 2-3 where G = 2, q = 1 and p = 2. As far as the model parameters are concerned, the following values have been employed: π 1 = 0.7, π 2 = 0.3, Y 1 = 1.5, Y 2 = 1, B 1 = (5, 2, 2), B 2 = (1, −2, −2), μ X 1 = (−2, −2), μ X 2 = (2, 2), X 1 = 1.0 0.2 0.2 1.0 , X 2 = 1.0 0.4 0.4 1.0 . Such values have been chosen so as to produce two well-separated groups of observations (see the upper panel of Fig. 1, with the pairwise scatterplots of the variables X 1 , X 2 and Y 1 for a sample of size I = 500 generated as just described). In this way, problems of label switching across simulations are less likely to occur. Furthermore, the ML estimates of ϑ are expected to be accurate enough to let the analysis be focused on the different ways of estimating the standard error ofθ. Using these parameter values, R = 10, 000 datasets (of size I ) have been generated as follows: 1. For the rth dataset (r = 1, . . . , R), a sample of size I is drawn from the standard p-dimensional normal distribution; this gives the vectors 1r , . . . , I r ; 2. For the rth dataset (r = 1, . . . , R), a sample of size I is drawn from the standard q-dimensional normal distribution; this gives the vectors η 1r , . . . , η I r ; 3. For the rth dataset (r = 1, . . . , R), a sample of size I is drawn from the Bernoulli distribution with parameter π 1 ; this produces the 0-1 vector z r = (z 1r , . . . , z I r ) ; 4. For the ith observation (i = 1, . . . , I ) of the rth dataset, vectors x ir and y ir are obtained as follows: where A X g and A Y g are matrices obtained from the spectral decompositions of X g and Y g , respectively. Such matrices are constructed such that In the second study, the datasets have been obtained through the same procedure used in the first one except from the computation of vectors ir and η ir . Namely, a sample of size I ·p is drawn from the uniform distribution in the interval (0,1) for the rth dataset (r = 1, . . . , R); this produces a vector * r , whose elements are transformed as follows: jr = √ 12( * jr − 0.5), j = 1, . . . , I · p; the vector r resulting from this transformation has zero mean and unit variance; partitioning r into I p−dimensional vectors leads to 1r , . . . , I r . The same process has been applied to obtain vectors η 1r , . . . , η I r . The second panel of Fig. 1 provides the pairwise scatterplots of X 1 , X 2 and Y 1 for a sample of size I = 500 from this second study.
In the third and fourth studies, the datasets have been generated as in the first and second studies, respectively, but using the following model parameters: μ X 1 = (−1, −1) , μ X 2 = (1, 1) . This change in the values of μ X 1 and μ X 2 leads to overlapping groups of observations (see the pairwise scatterplots of X 1 , X 2 and Y 1 for samples of size I = 500 in third and fourth panels of Fig. 1). The total number of model parameters in the first four studies is 19.
The fifth study has been carried out with the same settings of the first study but with p = 8 explanatory variables. The model parameters pertaining to X which have been employed to generate the datasets are as follows: As far as the effects of the regressors on Y 1 are concerned, they have been set as follows: In this latter study, the total number of model parameters is 109.
In all studies, Monte Carlo experiments have been performed with two different sample sizes: I = 250, 500 in the first four studies, I = 300, 500 in the last study. In all experiments, it has been assumed that the rth dataset {(x 1r , y 1r ), . . . , (x I r , y I r )} is generated from a model defined by Eqs. 2-3 with G = 2. Thus, the maximum likelihood estimateθ r of ϑ has been computed for r = 1, . . . , R under such an assumption. Parameter estimation has been carried out through the EM algorithm as implemented in the function cwm of the package flexCWM. As far as the initialisation of the parameters is concerned, an option has been employed, which executes 5 independent runs of the k-means algorithm and picks the solution maximising the observed-data log-likelihood among these runs. The maximum number of iterations of the EM algorithm has been set equal to 400. A convergence criterion based on the Aitken acceleration has been used, with a threshold = 10 −6 (for further details, see Mazza et al. 2018).
The R independent estimates of ϑ are used to approximate the true distribution ofθ and, in particular, the true standard errors of all the elements ofθ. Estimates of these standard errors have been computed using the three information-based estimators and the parametric bootstrap for R 1 = 2000 datasets obtained as described above. For each bootstrap estimate, 100 bootstrap samples have been employed. For the ML estimates of the model intercepts and regression coefficients, the standard errors estimated by the function cwm of the package flexCWM using the approach illustrated in the introduction have been included in the comparison. The performances of these strategies have been evaluated on the basis of an estimate of their biases and root mean squared errors (RMSE). A comparative evaluation of such approaches has been carried out also through the coverage probabilities (CP) of 90% and 95% confidence intervals based on the examined standard errors' estimates and the standard normal quantiles. In this latter comparison, the attention is focused on the expected mean values of the regressors (i.e. μ X 1 and μ X 2 ) and the regression coefficients (all the entries in the first column of B 1 and B 2 except the first one).
Tables 1, 2, 3 and 4 contain the biases and RMSEs for the first four Monte Carlo studies with samples of size 250. The same information for the last study and the sample size I = 300 can be found in Table 5. The corresponding values for the CPs are summarised in Tables 6, 7, 8, 9 and 10. In all the tables, the results obtained using the function cwm of the package flexCWM, the bootstrap and the estimators defined in Eqs. 13-15 are denoted as cwm, Boot, C 1 , C 2 and C 3 , respectively. From now on, B g [j, k] is used to denote the element on the j -th row and k-th column of matrix B g ; μ g [j ] represents the j -th element Table 1 Biases and root mean squared errors of the examined standard error estimators ofθ in the first study, I All entries have been multiplied by 100 Table 2 Biases and root mean squared errors of the examined standard error estimators ofθ in the second study, I All entries have been multiplied by 100 Table 3 Biases and root mean squared errors of the examined standard error estimators ofθ in the third study, I

0.926
Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125)

Table 10
Coverage probability of the 90% and 95% confidence intervals for μ Xg Table 5 provides the mean values of the biases and RMSEs of the ML estimates over the elements of each of the following vectors of model parameters: μ X g , v( X g ), B g [−1, 1] for g = 1, 2, where B g [−1, 1] is the vector obtained by dropping the first element from the first column of B g . Thus, B g [−1, 1] comprises the regression coefficients of the p covariates on Y 1 given g . In a similar way, Table 10 contains the mean values of the CPs of 90% and 95% confidence intervals over μ X g and B g [−1, 1] for g = 1, 2.
Under the experimental conditions considered in the first Monte Carlo study, biases are generally small for all the estimated standard errors (see Table 1). The overall best performance in terms of accuracy seems to be achieved by means of the estimator Cov 2 (θ). The bootstrap approach appears to provide the most precise estimates of the standard errors of Y g . The sandwich method is slightly more accurate than the bootstrap approach in estimating the standard errors of the ML estimates of the expected values of the regressors; the opposite result holds true when dealing with the estimation of the standard errors ofˆ X g . The highest root mean square errors are mostly obtained using either the function cwm of the package flexCWM or the estimator Cov 1 (θ), which are therefore not recommended. These results confirm both the best performance of an estimator based on the Hessian and the poor performance of an estimator based on the gradient vector under correctly specified models registered in a study dealing with multivariate normal mixture models (Boldea & Magnus, 2009). It is also worth noting that the accuracy of the approach implemented in flexCWM sharply deteriorates when the ML estimates of the intercept and regression coefficients of the second group (B 2 ) are considered. As far as the effective confidence levels for the parameters μ X g , B g [2, 1] and B g [3, 1] are concerned (Table 6), the obtained results are generally similar to one another and quite close to the nominal confidence levels for all the examined methods except the one implemented in flexCWM. With this latter method, the effective confidence levels for the regression coefficients clearly deviate from the nominal ones, especially in the second group of observations. All these results have been employed to run tests for the hypotheses of equality between effective and nominal confidence levels. This task has been carried out through asymptotic two-tailed normal tests for a proportion at a 0.00125 significance level (the Bonferroni correction 0.01/8 has been adopted to account for multiple tests performed for each estimation method and each nominal confidence level). All the effective CPs of the confidence intervals for the model regression coefficients obtained using both the estimator based on the gradient and the approach implemented in flexCWM appear to be significantly different from the corresponding nominal ones (see the entries in italics in Table 6). As far as the results from the bootstrap-based and Hessian-based estimators are concerned, the null hypothesis of equality between effective and nominal confidence levels should be rejected for two regression coefficients at both examined confidence levels; the same null hypothesis has to be rejected for only one regression coefficient when using the the sandwich approach.
In the second Monte Carlo study, a substantial increase in the biases of the estimated standard errors ofˆ X g andˆ Y g has been registered with all the examined estimators except for the sandwich method (Table 2); this latter method is also the most accurate. Using a Gaussian cluster-weighted model for the analysis of datasets generated under a uniform cluster-weighted model seems to have a little impact on the confidence intervals for both the expected values of the regressors and the regression coefficients (Table 7).
When the data are obtained from two overlapping groups of observations drawn from Gaussian distributions (third study), the resulting biases and RMSEs (see Table 3) are quite similar to the ones from the first study; the main effect of the reduction in the separation of the two groups is a general slight increase in the RMSEs with all the examined methods. The best accuracy is still achieved by the estimator Cov 2 (θ) for all model parameters, with the exception ofˆ Y g , whose standard errors are more accurately estimated by the bootstrap approach. As far as the effect of this reduction on the effective CPs of the confidence intervals is concerned (Table 8), the most remarkable result is an increase in the gap between nominal and effective CPs associated with the use of the approach implemented in flexCWM for the regression coefficients in the second group of observations. When the two overlapping groups of observations are generated from the uniform distribution (fourth study), both biases and RMSEs ofB 1 results remarkably increased with all the examined methods (see Table 4). However, it is worth noting that the lowest of such increases has always been associated with the use of Cov 3 (θ), which is also the estimator with the best accuracy for the ML estimate of variances and covariances of the regressors in both groups and the majority of the model parameter estimates. Furthermore, the sandwich estimator shows the best performance in terms of effective CPs that are not significantly different from the nominal ones (Table 9).
In the presence of datasets generated under a Gaussian cluster-weighted model with p = 8 regressors (fifth study), the most remarkable effects of a larger number of covariates on the performance of the examined estimators with samples of size I = 300 appear to be (see Table 5 in comparison with Table 1) a sharp decrease in the accuracy of the estimates of the standard errors produced by the method based on the gradient for all the parameters of the second group of observations and a deterioration in the performances of the bootstrap approach in reference to all the parameters of the conditional distribution of Y given X and g , especially B g [1, 1], for g = 1, 2. As far as the methods Cov 2 (θ) and Cov 3 (θ) are concerned, biases and RMSEs result to be quite similar to the ones from the first study for all parameter estimates except the intercepts for both groups. Thus, the best overall accuracy is still achieved by the estimator Cov 2 (θ).
The results from the five studies with samples of size 500 (see Tables A-J in the separate document with the supplementary materials) are generally in line with those just described. It is worth noting that using a larger sample size leads to a reduction in the RMSEs for all the examined estimators. With datasets containing two separated groups of observations (first and second studies), all the effective CPs of the confidence intervals obtained using the sandwich approach appear to be not significantly different from the corresponding nominal ones. When overlapping groups are considered (third and fourth studies), the estimator Cov 3 (θ) has produced confidence intervals whose effective levels are the closest to the nominal ones. Thus, overall, the obtained results show the robustness of the sandwich method.

A Comparison with Some Estimators Under Normal Mixtures
As already mentioned in the "Introduction", three information-based estimators of the covariance matrix of the ML estimator for finite normal mixture models were developed by Boldea and Magnus (2009): two of them are based on the gradient vector and the Hessian matrix of the incomplete log-likelihood under a normal mixture model; the third estimator exploits the sandwich approach. From now on, these three estimators will be denoted as BM 1 , BM 2 and BM 3 , respectively. Furthermore, it has been already highlighted that finite mixtures of Gaussian distributions and linear Gaussian cluster-weighted models define the same family of probability distributions (Ingrassia et al., 2012). Thus, it could be interesting to obtain an evaluation of the relationships between the estimators described in Section 3 and the estimators developed by Boldea and Magnus (2009). This task has been numerically performed by means of two additional simulation studies. The model employed to generate the simulated datasets is: where p = q = 1, G = 2, ψ = (π 1 , ψ 1 , ψ 2 ) , ψ 1 = (μ 1 , v( 1 ) ) , ψ 1 = (μ 2 , v( 2 ) ) , π 1 = 0.7, π 2 = 0.3, μ 1 = (0, 0), μ 2 = ( , ), 1 = 1.0 0.0 0.0 1.0 , 2 = 2.0 1.0 1.0 2.0 . The two studies have been carried out using 5 and 10 as values of so as to obtain two different levels of separation between μ 1 and μ 2 . In each study, 100 datasets have been generated for each of two sample sizes: I = 100, 250. The R packages mclust (Scrucca et al., 2016) and flexCWM have been employed to compute the ML estimates of ψ in model (16) with G = 2 and the ML estimates of ϑ in model (2) with G = 2, respectively. Furthermore, the standard errors ofθ have been estimated using Eqs. 13-15. As far asψ is concerned, estimated standard errors have been computed according to the solutions described in Boldea and Magnus (2009).
Models (16) and (2) are characterised by the same value of π 1 . Furthermore, as illustrated by Ingrassia et al. (2012), some elements in ϑ coincide with some elements in ψ; namely, μ Thus, the comparison between the estimators described in Section 3 and the estimators developed by Boldea and Magnus (2009) has been focused on the following two subvectors of model parameters: ) be the standard error of the j -th element ofθ computed using the estimator C m and the r-th dataset. Furthermore, let se m (ψ r [j ]) be the standard error of the j -th element ofψ obtained from the estimator BM m . In order to compare such estimated standard errors, the following differences have been computed: d rm (j ) = se m (ψ r [j ]) − se m (θ r [j ]), j = 1, . . . , 5, m = 1, 2, 3, r = 1, . . . , 100. The results obtained for = 5 and = 10 with samples of size I = 100 are graphically represented in Figs. 2 and 3, respectively. The distributions of the differences d rm (j ) for almost all the examined parameters appear to be centred around 0 and highly homogeneous, thus highlighting a general equivalence between the standard errors resulting from the two models. This result holds true especially for the estimators based on the Hessian matrix and the sandwich approach (m = 2, 3) when the separation between the two groups is larger ( = 10), and for the estimator based on the gradient vector (m = 1) when the separation is low ( = 5). However, it is also worth noting that with both levels of separation the differences in the standard errors ofπ 1 computed using the two estimators based on the gradient vector show a median value slightly below 0 and a distribution with negative skewness. Similar results have been obtained with samples of size I = 250 (see Figures A and B in the supplementary material).

Analysing Regional Tourism Data in Italy
Similar to other studies (see e.g. Cellini & Cuccia 2013), the analysis summarised here aims at evaluating the link between tourism flows and attendance at museums and monuments, with a focus on two Italian regions: Emilia Romagna (ER) and Veneto (Ve). For both regions, three variables have been examined: tourist arrivals (denoted Arriv), tourist overnights (Overn) and visits to State museums, monuments and museum networks (Visit). Measurements for such variables are available with a monthly frequency over the period January 1999 to December 2017. The source for Visit is the website of the Italian Ministry of Cultural Heritage (http://www.statistica.beniculturali.it). Data on Arriv and  Overn have been obtained from the websites of Emilia-Romagna 1 and Veneto 2 regional governments. Overall, the analysed dataset is composed of I = 228 monthly observations for six variables. Because of the goal of the analysis, these variables have been partitioned as follows: Y = (Visit ER, Visit Ve) , X = (Arriv ER, Overn ER, Arriv Ve, Overn Ve) . The analysed data are expressed in thousands. Models obtained from Eqs. 2-3 have been fitted to the dataset for G from 1 to 8. To this end, a function written for the R software environment which implements an EM algorithm for the ML estimation of multivariate Gaussian cluster-weighted linear models has been employed. In order to prevent problems due to the presence of singular or nearly singular matrices during the iterations, all covariance matrices have been required to have eigenvalues greater than 10 −20 ; furthermore, the ratio between the smallest and the largest eigenvalues of such matrices is required to be not lower than 10 −10 . Model parameters are initialised according to a two-step strategy. In the first step, the joint distribution of the covariates and responses is estimated using a mixture of G Gaussian models through the mclust package. This produces the required starting values for the G weights, mean vectors and covariance matrices for the predictors. In the second step, the initialisation of B g and Y g is obtained from the fitting of a Gaussian linear regression model to the sample of observations that have been assigned to the gth component of the mixture model estimated in the first step. The R package systemfit (Henningsen & Hamann, 2007) has been exploited to perform this task. The maximum number of iterations of the EM algorithm has been set equal to 500. A convergence criterion based on the Aitken acceleration has been used, with a threshold = 10 −8 . Table 11 shows the values of the maximised log-likelihood (l(θ)) and the Bayesian information criterion (BIC) (Schwarz, 1978) for the eight fitted models, where BI C = 2l(θ) − npar ln(I ), and npar denotes the number of model parameters. According to this criterion, the model with the best trade-off between the fit and complexity seems to be the one with G = 5 clusters of months. With this model, there is a perfect correspondence between some clusters and some months (see Table 12): cluster 2 only contains observations in April and May; cluster 4 only contains observations in June, July and August; cluster 5 only contains observations in January, February, November and December. As far as the remaining months are concerned, observations in September for the years 2005-2017 have been assigned to cluster 1; cluster 3 comprises the remaining observations in September together with all the observations in March and October. The obtained cluster structure   [1999][2000][2001][2002][2003][2004] and cluster 4 (from November to February) (see theμ g [j ] in Table 13). In all clusters, Veneto is characterised by mean values of both regressors which are always higher than those of Emilia-Romagna. During the examined period of time, there seems to be heterogeneity also on the effects of the tourist arrivals and overnights on the number of visits in both regions (see theB g [j, k] in Table 13). Such effects do not result to be always positive. Furthermore, an increase in the tourist arrivals and overnights in one region does not necessarily have a positive impact on the number of visits to State museums, monuments and museum networks of the other region.
Estimates of the standard errors for the parameter estimates of the selected clusterweighted model have been computed by the boostrap approach, using 100 bootstrap samples generated from the selected model. Furthermore, Cov(θ) has been estimated by resorting to Eqs. 13-15. The algorithm by Higham (1988), as implemented in the R package corpcor,    According to the estimator based on I 1 , five additional regression coefficients (three within cluster 4, two within cluster 5) result to be significantly different from zero; the same conclusion is obtained from the use of Cov 2 (θ) and Cov 3 (θ). The results obtained using the three methods illustrated in Section 3 suggest that tourist arrivals in Emilia Romagna have a positive significant effect on the visits to State museums, monuments and museum networks both in Emilia Romagna and in Veneto in June, July and August. Arriv ER has a significant and negative effect on Visit Ve in January, February, November and December. Furthermore, Arriv Ve seems to significantly affect Visit Ve, with a positive effect in January, February, November and December and a negative effect in June, July and August.
The tests based on the estimators Cov 2 (θ) and Cov 3 (θ) lead to a rejection of H 0 for further five regression coefficients, three of which concern the effect of tourist arrivals and overnights in Veneto in September for the years 2005-2017 (cluster 1). As far as cluster 2 is concerned, all regression coefficients seem to be not significantly different from 0 according to all approaches except for the effect of Overn Ve on Visit Ve, which results to be significant only when standard errors are estimated from the Hessian matrix. By exploiting the results contained in the non-diagonal elements of matrices Cov 1 (θ), Cov 2 (θ) and Cov 3 (θ), it is also possible to run tests for the significance of the difference between the effects of two different predictors on the same response in a given cluster or tests for the significance of the difference between the effects of the same predictor on the same response in two different clusters. For any given pair of regression coefficients in the model, the null hypothesis can be expressed as follows: H 0 : B g 1 [j 1 , k 1 ] = B g 2 [j 2 , k 2 ].

3
(3, 2, 1) ( 5, 2, 1) 0.2757 For each of these illustrations, Table 17 summarises the results of the z tests run using the three estimated covariance matrices (the non-diagonal elements are not reported). For the first two examples, the compared methods lead to results which are all in favour of the null hypothesis. In the third illustration, the null hypothesis should be rejected (α = 0.05) when variances and covariances are estimated using both the Hessian and sandwich approaches.

Conclusions
Three information-based estimators of the asymptotic covariance matrix of the ML estimator under multivariate Gaussian linear cluster-weighted models have been illustrated. For their computation, formulae for the score vector and Hessian matrix of the incomplete log-likelihood have been derived. Properties of these estimators have been numerically evaluated using simulated samples in comparison with the parametric bootstrap-based estimator. For the ML estimates of the model intercepts and regression coefficients, the comparison has included an approach implemented in the package flexCWM in which estimated standard errors are computed by fitting G separate linear weighted regression models using the estimated posterior probabilities as weights. With correctly specified models, the most accurate estimator of the standard error of the ML estimator is the one based on the Hessian matrix. When Gaussian cluster-weighted models are fitted to datasets generated from a uniform distribution, the best accuracy is achieved with the sandwich estimator. Overall, the obtained results show the robustness of this latter method. Through these informationbased estimators, the tasks of computing approximated confidence intervals and running tests concerning pairs of parameters can be easily carried out, as illustrated through a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions. Asymptotic properties of the estimators introduced here could also be studied from a theoretical point of view. For example, suitable regularity conditions can be defined so as to provide a general assessment of their consistency (see for example Galimberti et al. (2020) for a similar study in the context of clusterwise regression models).