Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models

Soffritti, Gabriele

doi:10.1007/s00357-021-09390-9

Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models

Open access
Published: 31 July 2021

Volume 38, pages 594–625, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Classification Aims and scope Submit manuscript

Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models

Download PDF

Gabriele Soffritti ORCID: orcid.org/0000-0002-7575-892X¹

3563 Accesses
3 Citations
Explore all metrics

Abstract

In recent years, the research into cluster-weighted models has been intense. However, estimating the covariance matrix of the maximum likelihood estimator under a cluster-weighted model is still an open issue. Here, an approach is developed in which information-based estimators of such a covariance matrix are obtained from the incomplete data log-likelihood of the multivariate Gaussian linear cluster-weighted model. To this end, analytical expressions for the score vector and Hessian matrix are provided. Three estimators of the asymptotic covariance matrix of the maximum likelihood estimator, based on the score vector and Hessian matrix, are introduced. The performances of these estimators are numerically evaluated using simulated datasets in comparison with a bootstrap-based estimator; their usefulness is illustrated through a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions.

Empirical Likelihood-Based Inference for the Difference of Two Location Parameters Using Smoothed M-Estimators

Article 14 February 2019

A Comparative Study of Robust and Stable Estimates of Multivariate Location

Article 21 February 2019

Covariance matrix estimation of the maximum likelihood estimator in multivariate clusterwise linear regression

Article 18 May 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cluster-weighted models constitute an approach to regression analysis with random covariates in which supervised (regression) and unsupervised (model-based cluster analysis) learning methods are jointly exploited (Hastie et al., 2009). In this approach, a given random vector is assumed to be composed of an outcome Y (response, dependent variable) and its explanatory variables X (covariates, predictors). Furthermore, sample observations are allowed to come from a population composed of several unknown sub-populations. Finally, the joint distribution of the outcome and the covariates is modelled through a finite mixture model specified so as to account for a different effect of the covariates on the response within each sub-population. Thus, cluster-weighted models are useful to perform model-based cluster analysis in a multivariate regression setting with random covariates and unobserved heterogeneity.

Since the introduction of this approach (Gershenfeld, 1997), the research into cluster-weighted models has been intense, especially in the last 8 years. Models for continuous variables under normal mixture models have been proposed by Ingrassia et al. (2012) and Dang et al. (2017). Robustified solutions have been developed by Ingrassia et al. (2014) and Punzo and McNicholas (2017), based on the use of Student t and contaminated normal mixture distributions, respectively. Punzo and Ingrassia (2013), Punzo and Ingrassia (2016), Ingrassia et al. (2015) and Di Mari et al. (2020) have introduced models for various types of responses. Models able to deal with non-linear relationships or many covariates have been proposed by Punzo (2014, Subedi et al. (2013) and Subedi et al. (2015).

By focusing the attention on Gaussian cluster-weighted models in which the effects of the covariates on the response within each sub-population are linear, model parameters are generally estimated through the maximum likelihood (ML) method by resorting to the expectation-maximisation (EM) algorithm (Dempster et al., 1977), which is widely employed in incomplete-data problems. In these models, the observed data $\mathcal {S}=\{(\mathbf {x}_{i},\mathbf {y}_{i}$), i = 1, … , I} are incomplete because the specific component density that generates the I sample observations is missing. This missing information is modelled through an unobserved variable coming from a pre-specified multinomial distribution and is added to the observed data so as to form the so-called complete data. Then, the ML estimate is computed from the maximisation of the complete data log-likelihood. A description of the EM algorithm for the linear Gaussian cluster-weighted model can be found in Dang et al. (2017). Specific functions implementing such algorithm for models with a univariate response are available in the package flexCWM (Mazza et al., 2018) for the R software environment (R Core Team, 2020).

A by-product of this estimating approach is a set of K estimated posterior probabilities that each sample observation comes from the K Gaussian distributions of the mixture. Thus, a by-product of fitting a linear Gaussian cluster-weighted model is a clustering of the I sample observations, based on a rule that assigns an observation to the distribution of the mixture to which it has the highest posterior probability of belonging. However, an estimating approach based on the use of an EM algorithm has the drawback of not automatically producing any estimate of the covariance matrix of the ML estimator. The assessment of the sample variability of the parameter estimates in a linear Gaussian cluster-weighted model is a necessary step in the subsequent development of inference methods for the model parameters, such as asymptotic confidence intervals, tests for the significance of the effect of any covariate on a given response within any sub-population and tests for the significance of the difference between the effects of the same covariate on a given response in two different sub-populations. Thus, additional computations are necessary to obtain an assessment of the sample variability of model parameter estimates. To the best of the author’s knowledge, the only solution currently available for the linear Gaussian cluster-weighted models is implemented in the flexCWM package, where approximated standard errors are only provided for the intercepts and regression coefficients according to an approach in which a number of separate linear regression analyses with random covariates are carried out (one for each component of the mixture), and the sample observations are weighted with their estimated posterior probabilities of coming from the different components of the mixture. However, this approach only partially exploits the sample information about the parameters under a linear normal cluster-weighted model. Thus, other approaches could be investigated and detected. A solution can be obtained by resorting to bootstrap methods (see, e.g., Newton and Raftery 1994; Basford et al., 1997; McLachlan & Peel 2000). However, the overall computational process associated with the use of bootstrap techniques can become particularly time-consuming and complex because of difficulties typically associated with the fitting of finite mixture models (e.g. label-switching problems, possible convergence failures of the EM algorithm on the bootstrap samples). These inconveniences could be avoided through an approach in which the observed information matrix is obtained from the incomplete data log-likelihood and employed to compute information-based estimators of the covariance matrix of the ML estimator (see e.g. McLachlan & Peel 2000). This task has been successfully carried out under normal mixture models (Boldea & Magnus, 2009) and clusterwise linear regression models (Galimberti et al., 2021).

In order to make it possible to properly assess both the variability of and the covariance between ML estimates of all the parameters under multivariate linear normal cluster-weighted models with a multivariate response, the gradient vector and second-order derivative matrix of the incomplete data log-likelihood for these models are explicitly derived here. Then, these results are used to obtain three estimators of the observed information matrix and the covariance matrix of the ML estimator. Properties of such estimators are numerically investigated using simulated datasets in comparison with the parametric bootstrap and the approach implemented in flexCWM. A numerical evaluation of the relationships between the estimators introduced here and those described by Boldea and Magnus (2009) is also provided. The practical usefulness of the proposed estimators is illustrated in a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions.

The remainder of the paper is organised as follows. Section 2 provides the definition of multivariate Gaussian linear cluster-weighted model together with some quantities employed in the derivation of the score vector and the Hessian matrix. Section 3 describes the estimators of the information matrix and the covariance matrix of the ML estimator. Sections 4 and 5 summarize the main results obtained from the analysis of simulated and real datasets, respectively. The analytical expressions of the score vector and the Hessian matrix are reported in Appendix. Technical details and additional results from the analysis of simulated datasets can be found in a separate document as supplementary materials.

2 Score Vector and Hessian Matrix of Gaussian Linear Cluster-Weighted Models

Let $\textbf {X}=(X_{1},...,X_{p})^{\prime }$ and $\textbf {Y}=(Y_{1},...,Y_{q})^{\prime }$ be two continuous random vectors with joint probability density function (p.d.f.) f(x,y). The response vector Y and the covariate vector X take values in $\mathbb {R}^{q}$ and $\mathbb {R}^{p}$, respectively. Following Dang et al. (2017), $(\textbf {X}^{\prime },\textbf {Y}^{\prime })^{\prime }$ follows a cluster-weighted model of order G if

$$ f(\textbf{x},\textbf{y})=\sum\limits_{g=1}^{G} \pi_{g} f(\textbf{x}|\mathbf{\Omega}_{g}) f(\textbf{y}|\textbf{x},\mathbf{\Omega}_{g}), $$

(1)

where Ω₁, … , Ω_G denote the G unknown sub-populations (Ω_g ∩ Ω_j = ∅ ∀g≠j), $\pi _{g}=\mathbb {P}(\mathbf {\Omega }_{g})$, π_g > 0 ∀g, ${\sum }_{g=1}^{G}\pi _{g}=1$, f(x|Ω_g) is the conditional p.d.f. of X given Ω_g, f(y|x,Ω_g) is the conditional p.d.f. of Y given X and Ω_g. A Gaussian linear cluster-weighted model is obtained from Eq. 1 by additionally assuming that the distributions of X|Ω_g and Y|(X = x,Ω_g) are Gaussian for g = 1, … , G and the effect of X on Y for any Ω_g is linear. By embedding all these assumptions into model (1), f(x,y) becomes

$$ f(\textbf{x},\textbf{y}; \boldsymbol{\vartheta})=\sum\limits_{g=1}^{G} \pi_{g} \phi_{p}(\textbf{x};\boldsymbol{\mu}_{\textbf{X}_{g}},\boldsymbol{\Sigma}_{\textbf{X}_{g}}) \phi_{q}(\textbf{y}|\textbf{x};\textbf{B}^{\prime}_{g}\textbf{x}^{*},\boldsymbol{\Sigma}_{\textbf{Y}_{g}}), $$

(2)

where ϕ_d(⋅;μ,Σ) represents the p.d.f. of a normal d-dimensional random vector with expected value μ and positive definite covariance matrix Σ,

$$ \textbf{B}^{\prime}_{g}\textbf{x}^{*}=\mathbb{E}(\textbf{Y}|\textbf{X}=\textbf{x},\mathbf{\Omega}_{g}), \ g=1, \ldots, G, $$

(3)

with $\textbf {B}_{g}\in \mathbb {R}^{(1+p)\times q}$, $\textbf {x}^{*}=(1,\textbf {x}^{\prime })^{\prime }$, and 𝜗 is the vector of the unknown parameters. It has been proved that linear Gaussian cluster-weighted models of order G define the same family of probability distributions generated by mixtures of G Gaussian models for the random vector $\textbf {Z}=(\textbf {X}^{\prime },\textbf {Y}^{\prime })^{\prime }$ (Ingrassia et al., 2012). However, it is important to stress that this latter type of mixtures cannot be employed to account for local linear dependencies between X and Y.

The score vector and Hessian matrix of model (2) are derived by taking account of the fact that the weights π₁, … , π_G sum to one and the covariance matrices are symmetric. The first constraint is introduced in the maximization of the likelihood function by differentiating with respect to $\boldsymbol {\pi }=\left (\pi _{1},\dots ,\pi _{G-1}\right )^{\prime }$ and by setting π_G = 1 − π₁ −⋯ − π_G− 1. The constraints on the covariance matrices are dealt with by using the operators vec(⋅), v(⋅) and the duplication matrix. Namely, vec(B) is the column vector obtained by stacking the columns of matrix B one underneath the other. v(Σ) denotes the column vector obtained from vec(Σ) by eliminating all supradiagonal elements of a symmetric matrix Σ (thus, v(Σ) contains only the lower triangular part of Σ). The duplication matrix G is the unique matrix which transforms v(Σ) into vec(Σ) (G v(Σ) = vec(Σ)) (see e.g. Magnus and Neudecker 1988). Using this notation, the vector of the unknown parameters in model (2) can be denoted as $\boldsymbol {\vartheta }= (\boldsymbol {\pi }^{\prime },\boldsymbol {\theta }^{\prime }_{1}, \ldots , \boldsymbol {\theta }^{\prime }_{G})^{\prime }$, where $\boldsymbol {\theta }_{g}=\left (\boldsymbol {\mu }^{\prime }_{\textbf {X}_{g}}, \mathrm {v}\left (\boldsymbol {\Sigma }_{\textbf {X}_{g}}\right )^{\prime }, \text {vec}\left (\textbf {B}_{g}\right )^{\prime }, \mathrm {v}\left (\boldsymbol {\Sigma }_{\textbf {Y}_{g}}\right )^{\prime }\right )^{\prime }$.

Suppose that the observed data $\mathcal {S}=\{(\mathbf {x}_{i},\mathbf {y}_{i}$), i = 1, … , I} is composed of I independent and identically distributed observations. Then, the incomplete log-likelihood function under the model (2) is

$$ l(\boldsymbol{\vartheta})=\sum\limits_{i=1}^{I}\ln \biggl(\sum\limits_{g=1}^{G}\pi_{g}\phi_{p}\bigl(\textbf{x}_{i};\boldsymbol{\mu}_{\textbf{X}_{g}},\boldsymbol{\Sigma}_{\textbf{X}_{g}}\bigr)\phi_{q}\bigl(\textbf{y}_{i}|\textbf{x}_{i};\textbf{B}^{\prime}_{g}\textbf{x}^{*}_{i},\boldsymbol{\Sigma}_{\textbf{Y}_{g}}\bigr)\biggr). $$

(4)

Each sample observation provides its own contribution to the g th term of the sum that defines the mixture (2). As far as the contribution of the i th observation is concerned, it is given by:

$$ \begin{array}{@{}rcl@{}} p_{gi}&=&\pi_{g} (2\pi)^{-\frac{p}{2}} \det(\boldsymbol{\Sigma}_{\textbf{X}_{g}})^{-\frac{1}{2}}\exp\biggl[-{1 \over 2}(\textbf{x}_{i}-\boldsymbol{\mu}_{\textbf{X}_{g}})^{\prime}\boldsymbol{\Sigma}^{-1}_{\textbf{X}_{g}}(\textbf{x}_{i}-\boldsymbol{\mu}_{\textbf{X}_{g}})\biggr] \\ & & (2\pi)^{-\frac{q}{2}} \det(\boldsymbol{\Sigma}_{\textbf{Y}_{g}})^{-\frac{1}{2}}\exp\biggl[-{1 \over 2}(\textbf{y}_{i}-\textbf{B}^{\prime}_{g}\textbf{x}^{*}_{i})^{\prime}\boldsymbol{\Sigma}^{-1}_{\textbf{Y}_{g}}(\textbf{y}_{i}-\textbf{B}^{\prime}_{g}\textbf{x}^{*}_{i})\biggr]. \end{array} $$

(5)

By exploiting properties of the logarithm and trace, the following equality holds true:

$$ \begin{array}{@{}rcl@{}} \ln p_{gi}&=& \ln \pi_{g} -{p \over 2}\ln (2\pi)-{1 \over 2}\text{ln det}(\boldsymbol{\Sigma}_{\textbf{X}_{g}})-{1 \over 2}\text{tr}\biggl[\boldsymbol{\Sigma}^{-1}_{\textbf{X}_{g}}(\textbf{x}_{i}-\boldsymbol{\mu}_{\textbf{X}_{g}})(\textbf{x}_{i}-\boldsymbol{\mu}_{\textbf{X}_{g}})^{\prime}\biggr] +\\ && -{q \over 2}\ln (2\pi)-{1 \over 2}\text{ln det}(\boldsymbol{\Sigma}_{\textbf{Y}_{g}})-{1 \over 2}\text{tr}\biggl[\boldsymbol{\Sigma}^{-1}_{\textbf{Y}_{g}}(\textbf{y}_{i}-\textbf{B}^{\prime}_{g}\textbf{x}^{*}_{i})(\textbf{y}_{i}-\textbf{B}^{\prime}_{g}\textbf{x}^{*}_{i})^{\prime}\biggr]. \end{array} $$

(6)

The explicit forms of the score vector and Hessian matrix, as developed here, require the introduction of some additional notation. Namely, let

$$ \begin{array}{@{}rcl@{}} \alpha_{gi} & = & \frac{p_{gi}}{{\sum}_{j=1}^{G}p_{ji}}, \end{array} $$

(7)

$$ \begin{array}{@{}rcl@{}} \mathbf{a}_{g} & = & \frac{1}{\pi_{g}}\mathbf{e}_{g},\mspace{30mu} g=1,\dots,G-1,\\ \mathbf{a}_{G} & = & -\frac{1}{\pi_{G}}\boldsymbol{1}_{G-1}, \end{array} $$

(8)

where e_g denotes the g th column of the identity matrix of order G − 1 and 1_G− 1 is the (G − 1) × 1 vector having each element equal to 1. The following quantities are also required:

$$ \begin{array}{@{}rcl@{}} \textbf{o}_{gi}&=&\boldsymbol{\Sigma}^{-1}_{\textbf{Y}_{g}}(\textbf{y}_{i}-\textbf{B}^{\prime}_{g}\textbf{x}^{*}_{i}), \end{array} $$

(9)

$$ \begin{array}{@{}rcl@{}} \textbf{O}_{gi}&=&\boldsymbol{\Sigma}^{-1}_{\textbf{Y}_{g}}-\textbf{o}_{gi}\textbf{o}^{\prime}_{gi}, \end{array} $$

(10)

$$ \begin{array}{@{}rcl@{}} \textbf{f}_{gi} & = & \boldsymbol{\Sigma}^{-1}_{\textbf{X}_{g}}(\textbf{x}_{i}-\boldsymbol{\mu}_{\textbf{X}_{g}}), \end{array} $$

(11)

$$ \begin{array}{@{}rcl@{}} \textbf{F}_{gi} & = & \boldsymbol{\Sigma}^{-1}_{\textbf{X}_{g}}-\textbf{f}_{gi}\textbf{f}^{\prime}_{gi}. \end{array} $$

(12)

The explicit forms of the score vector S(𝜗) and Hessian matrix H(𝜗) for a Gaussian linear cluster-weighted model are provided in Theorems 1 and 2 (see Appendix). Proofs can be found in the document with the supplementary materials.

3 Covariance Matrix Estimation of the ML Estimator

In the ML approach, the information matrix $\mathfrak {I}(\boldsymbol {\vartheta })$ plays a crucial role, as it is used to asymptotically estimate the covariance of the ML estimator of the model parameters. Under regularity conditions and if the model is correctly specified, $\mathfrak {I}(\boldsymbol {\vartheta })$ is given either by the covariance of the score function $\mathbb {E}\left (S(\boldsymbol {\vartheta }) S(\boldsymbol {\vartheta })^{\prime }\right )$ or the negative of the expected value of the Hessian matrix $-\mathbb {E}\left (H(\boldsymbol {\vartheta })\right )$. Clearly, an analytical evaluation of the expectations required to obtain $\mathfrak {I}(\boldsymbol {\vartheta })$ under model (2) is quite complex. By exploiting some asymptotic results concerning ML estimation (White, 1982), it is possible to obtain the following asymptotic estimators of $\mathfrak {I}(\boldsymbol {\vartheta })$:

$$ \mathcal{I}_{1}=\sum\limits_{i=1}^{I}\textbf{S}_{i}(\boldsymbol{\hat{\vartheta}})\textbf{S}_{i}(\boldsymbol{\hat{\vartheta}})^{\prime}, \ \mathcal{I}_{2}=-\sum\limits_{i=1}^{I}\textbf{H}_{i}(\boldsymbol{\hat{\vartheta}}), $$

where $\textbf {S}_{i}(\boldsymbol {\hat {\vartheta }})$ and $\textbf {H}_{i}(\boldsymbol {\hat {\vartheta }})$ denote the contribution of the i th sample observation to the score function and Hessian matrix evaluated at the ML estimate $\hat {\boldsymbol {\vartheta }}$, respectively. They can be used to obtain the following asymptotic estimators of $\text {Cov}(\hat {\boldsymbol {\vartheta }})$, the covariance matrix of $\hat {\boldsymbol {\vartheta }}$:

$$ \begin{array}{@{}rcl@{}} \widehat{\text{Cov}}_{1}(\hat{\boldsymbol{\vartheta}}) & = & \mathcal{I}_{1}^{-1}, \end{array} $$

(13)

$$ \begin{array}{@{}rcl@{}} \widehat{\text{Cov}}_{2}(\hat{\boldsymbol{\vartheta}}) & = & \mathcal{I}_{2}^{-1}. \end{array} $$

(14)

Under suitable conditions (see e.g. Newey & McFadden 1994; Ritter 2015, for a general discussion and some results specifically dealing with finite mixture models, respectively), $\widehat {\text {Cov}}_{1}(\hat {\boldsymbol {\vartheta }})$ and $\widehat {\text {Cov}}_{2}(\hat {\boldsymbol {\vartheta }})$ can be considered consistent estimators of $\text {Cov}(\hat {\boldsymbol {\vartheta }})$ when the model is correctly specified. By exploiting the so-called sandwich approach (see e.g. White 1980), the following robust estimator can also be employed:

$$ \widehat{\text{Cov}}_{3}(\hat{\boldsymbol{\vartheta}})=\mathcal{I}^{-1}_{2}\mathcal{I}_{1}\mathcal{I}^{-1}_{2}. $$

(15)

It is worth noting that for the existence of the estimators $\widehat {\text {Cov}}_{2}(\hat {\boldsymbol {\vartheta }})$ and $\widehat {\text {Cov}}_{3}(\hat {\boldsymbol {\vartheta }})$ it is required that matrix $\mathcal {I}_{2}$ is positive definite and well conditioned. The same requirements have to be fulfilled by matrix $\mathcal {I}_{1}$ in order to guarantee that $\widehat {\text {Cov}}_{1}(\hat {\boldsymbol {\vartheta }})$ exists. With large-scale covariance matrices and small sample sizes, $\mathcal {I}_{1}$ and/or $\mathcal {I}_{2}$ could be ill-conditioned. These situations can be managed by resorting to procedures able to produce improved estimators of $\mathfrak {I}(\boldsymbol {\vartheta })$ from either $\mathcal {I}_{1}$ or $\mathcal {I}_{2}$. For example, the algorithm by Higham (1988) computes the nearest positive definite matrix of a given symmetric matrix by adjusting its eigenvalues. Other approaches which exploit techniques of variance reduction could also be adopted (see e.g. Schäfer and Strimmer 2005).

4 Experimental Results from Simulated Datasets

4.1 Numerical Study of the Properties of the Proposed Estimators

In order to evaluate the properties of $\widehat {\text {Cov}}_{1}(\hat {\boldsymbol {\vartheta }})$, $\widehat {\text {Cov}}_{2}(\hat {\boldsymbol {\vartheta }})$ and $\widehat {\text {Cov}}_{3}(\hat {\boldsymbol {\vartheta }})$ in comparison with the estimators based on the parametric bootstrap and the approach implemented in flexCWM, five Monte Carlo studies have been performed. In the first study, the artificial datasets have been generated under a model defined by Eqs. 2–3 where G = 2, q = 1 and p = 2. As far as the model parameters are concerned, the following values have been employed: π₁ = 0.7, π₂ = 0.3, $\boldsymbol {\Sigma }_{\textbf {Y}_{1}}=1.5$, $\boldsymbol {\Sigma }_{\textbf {Y}_{2}}=1$, $\textbf {B}^{\prime }_{1}=(5,2,2)$, $\textbf {B}^{\prime }_{2}=(1,-2,-2)$, $\boldsymbol {\mu }^{\prime }_{\textbf {X}_{1}}=(-2,-2)$, $\boldsymbol {\mu }^{\prime }_{\textbf {X}_{2}}=(2,2)$, $\boldsymbol {\Sigma }_{\textbf {X}_{1}}=\left (\begin {smallmatrix} 1.0 & 0.2\\ 0.2 &1.0 \end {smallmatrix}\right )$, $\boldsymbol {\Sigma }_{\textbf {X}_{2}}=\left (\begin {smallmatrix} 1.0 & 0.4\\ 0.4 &1.0 \end {smallmatrix}\right )$. Such values have been chosen so as to produce two well-separated groups of observations (see the upper panel of Fig. 1, with the pairwise scatterplots of the variables X₁, X₂ and Y₁ for a sample of size I = 500 generated as just described). In this way, problems of label switching across simulations are less likely to occur. Furthermore, the ML estimates of 𝜗 are expected to be accurate enough to let the analysis be focused on the different ways of estimating the standard error of $\hat {\boldsymbol {\vartheta }}$. Using these parameter values, R = 10,000 datasets (of size I) have been generated as follows:

1.
For the r th dataset (r = 1, … , R), a sample of size I is drawn from the standard p-dimensional normal distribution; this gives the vectors 𝜖_1r, … , 𝜖_Ir;
2.
For the r th dataset (r = 1, … , R), a sample of size I is drawn from the standard q-dimensional normal distribution; this gives the vectors η_1r, … , η_Ir;
3.
For the r th dataset (r = 1, … , R), a sample of size I is drawn from the Bernoulli distribution with parameter π₁; this produces the 0-1 vector $\mathbf {z}_{r}=(z_{1r}, \ldots , z_{Ir})^{\prime }$;
4.
For the i th observation (i = 1, … , I) of the r th dataset, vectors x_ir and y_ir are obtained as follows:
$$ \begin{array}{@{}rcl@{}} \mathbf{x}_{ir} & = & \boldsymbol{\mu}_{\textbf{X}_{1}} +\textbf{A}_{\textbf{X}_{1}} \boldsymbol{\epsilon}_{ir}, \ \ \mathbf{y}_{ir} = \textbf{B}^{\prime}_{1} \mathbf{x}_{ir} + \textbf{A}_{\textbf{Y}_{1}} \boldsymbol{\eta}_{ir} \ \ \text{if \ } z_{ir}=1,\\ \mathbf{x}_{ir} & = & \boldsymbol{\mu}_{\textbf{X}_{2}} +\textbf{A}_{\textbf{X}_{2}} \boldsymbol{\epsilon}_{ir}, \ \ \mathbf{y}_{ir} = \textbf{B}^{\prime}_{2} \mathbf{x}_{ir} + \textbf{A}_{\textbf{Y}_{2}} \boldsymbol{\eta}_{ir} \ \ \text{if } \ z_{ir}=0, \end{array} $$
where $\textbf {A}_{\textbf {X}_{g}}$ and $\textbf {A}_{\textbf {Y}_{g}}$ are matrices obtained from the spectral decompositions of $\boldsymbol {\Sigma }_{\textbf {X}_{g}}$ and $\boldsymbol {\Sigma }_{\textbf {Y}_{g}}$, respectively. Such matrices are constructed such that $\textbf {A}_{\textbf {X}_{g}} \textbf {A}^{\prime }_{\textbf {X}_{g}}= \boldsymbol {\Sigma }_{\textbf {X}_{g}}$, $\textbf {A}_{\textbf {Y}_{g}} \textbf {A}^{\prime }_{\textbf {Y}_{g}}= \boldsymbol {\Sigma }_{\textbf {Y}_{g}}$, for g = 1,2.

In the second study, the datasets have been obtained through the same procedure used in the first one except from the computation of vectors 𝜖_ir and η_ir. Namely, a sample of size I ⋅ p is drawn from the uniform distribution in the interval (0,1) for the r th dataset (r = 1, … , R); this produces a vector $\boldsymbol {\epsilon }^{*}_{r}$, whose elements are transformed as follows: $\epsilon _{jr}=\sqrt {12}(\epsilon _{jr}^{*}-0.5), \ j=1, \ldots , I \cdot p;$ the vector 𝜖_r resulting from this transformation has zero mean and unit variance; partitioning 𝜖_r into Ip −dimensional vectors leads to 𝜖_1r, … , 𝜖_Ir. The same process has been applied to obtain vectors η_1r, … , η_Ir. The second panel of Fig. 1 provides the pairwise scatterplots of X₁, X₂ and Y₁ for a sample of size I = 500 from this second study.

In the third and fourth studies, the datasets have been generated as in the first and second studies, respectively, but using the following model parameters: $\boldsymbol {\mu }_{\textbf {X}_{1}} = (-1,-1)^{\prime }$, $\boldsymbol {\mu }_{\textbf {X}_{2}} = (1,1)^{\prime }$. This change in the values of $\boldsymbol {\mu }_{\textbf {X}_{1}}$ and $\boldsymbol {\mu }_{\textbf {X}_{2}}$ leads to overlapping groups of observations (see the pairwise scatterplots of X₁, X₂ and Y₁ for samples of size I = 500 in third and fourth panels of Fig. 1). The total number of model parameters in the first four studies is 19.

The fifth study has been carried out with the same settings of the first study but with p = 8 explanatory variables. The model parameters pertaining to X which have been employed to generate the datasets are as follows: $\boldsymbol {\mu }_{\textbf {X}_{1}} = -2 \cdot \textbf {1}_{8}$, $\boldsymbol {\mu }_{\textbf {X}_{2}} = 2 \cdot \textbf {1}_{8}$, V (X_j|Ω_g) = 1 ∀j for g = 1,2, $\text {Cov}(X_{j},X_{h}|\mathbf {\Omega }_{1})=1 - \frac {|j-h|}{8} \ \forall j \neq h$, $\text {Cov}(X_{j},X_{h}|\mathbf {\Omega }_{2})=1 - \frac {|j-h|}{4} \ \forall j \neq h$. As far as the effects of the regressors on Y₁ are concerned, they have been set as follows: $\textbf {B}_{1}=(5,\boldsymbol {\mu }^{\prime }_{\textbf {X}_{2}})^{\prime }$, $\textbf {B}_{2}=(1,\boldsymbol {\mu }^{\prime }_{\textbf {X}_{1}})^{\prime }$. In this latter study, the total number of model parameters is 109.

In all studies, Monte Carlo experiments have been performed with two different sample sizes: I = 250,500 in the first four studies, I = 300,500 in the last study. In all experiments, it has been assumed that the r th dataset {(x_1r,y_1r), … , (x_Ir,y_Ir)} is generated from a model defined by Eqs. 2–3 with G = 2. Thus, the maximum likelihood estimate $\hat {\boldsymbol {\vartheta }}_{r}$ of 𝜗 has been computed for r = 1, … , R under such an assumption. Parameter estimation has been carried out through the EM algorithm as implemented in the function cwm of the package flexCWM. As far as the initialisation of the parameters is concerned, an option has been employed, which executes 5 independent runs of the k-means algorithm and picks the solution maximising the observed-data log-likelihood among these runs. The maximum number of iterations of the EM algorithm has been set equal to 400. A convergence criterion based on the Aitken acceleration has been used, with a threshold 𝜖 = 10^− 6 (for further details, see Mazza et al., 2018).

The R independent estimates of 𝜗 are used to approximate the true distribution of $\hat {\boldsymbol {\vartheta }}$ and, in particular, the true standard errors of all the elements of $\hat {\boldsymbol {\vartheta }}$. Estimates of these standard errors have been computed using the three information-based estimators and the parametric bootstrap for R₁ = 2000 datasets obtained as described above. For each bootstrap estimate, 100 bootstrap samples have been employed. For the ML estimates of the model intercepts and regression coefficients, the standard errors estimated by the function cwm of the package flexCWM using the approach illustrated in the introduction have been included in the comparison. The performances of these strategies have been evaluated on the basis of an estimate of their biases and root mean squared errors (RMSE). A comparative evaluation of such approaches has been carried out also through the coverage probabilities (CP) of 90% and 95% confidence intervals based on the examined standard errors’ estimates and the standard normal quantiles. In this latter comparison, the attention is focused on the expected mean values of the regressors (i.e. $\boldsymbol {\mu }_{\textbf {X}_{1}}$ and $\boldsymbol {\mu }_{\textbf {X}_{2}}$) and the regression coefficients (all the entries in the first column of B₁ and B₂ except the first one).

Tables 1, 2, 3 and 4 contain the biases and RMSEs for the first four Monte Carlo studies with samples of size 250. The same information for the last study and the sample size I = 300 can be found in Table 5. The corresponding values for the CPs are summarised in Tables 6, 7, 8, 9 and 10. In all the tables, the results obtained using the function cwm of the package flexCWM, the bootstrap and the estimators defined in Eqs. 13–15 are denoted as cwm, Boot, C₁, C₂ and C₃, respectively. From now on, B_g[j,k] is used to denote the element on the j-th row and k-th column of matrix B_g; μ_g[j] represents the j-th element of μ_g. Due to the large number of parameters pertaining to the regressors in the fifth study, Table 5 provides the mean values of the biases and RMSEs of the ML estimates over the elements of each of the following vectors of model parameters: $\boldsymbol {\mu }_{\textbf {X}_{g}}$, $\mathrm {v}(\boldsymbol {\Sigma }_{\textbf {X}_{g}})$, B_g[− 1,1] for g = 1,2, where B_g[− 1,1] is the vector obtained by dropping the first element from the first column of B_g. Thus, B_g[− 1,1] comprises the regression coefficients of the p covariates on Y₁ given Ω_g. In a similar way, Table 10 contains the mean values of the CPs of 90% and 95% confidence intervals over $\boldsymbol {\mu }_{\textbf {X}_{g}}$ and B_g[− 1,1] for g = 1, 2.

Table 1 Biases and root mean squared errors of the examined standard error estimators of $\hat {\boldsymbol {\vartheta }}$ in the first study, I = 250

Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models

Abstract

Similar content being viewed by others

Empirical Likelihood-Based Inference for the Difference of Two Location Parameters Using Smoothed M-Estimators

A Comparative Study of Robust and Stable Estimates of Multivariate Location

Covariance matrix estimation of the maximum likelihood estimator in multivariate clusterwise linear regression

1 Introduction

2 Score Vector and Hessian Matrix of Gaussian Linear Cluster-Weighted Models

3 Covariance Matrix Estimation of the ML Estimator

4 Experimental Results from Simulated Datasets

4.1 Numerical Study of the Properties of the Proposed Estimators

4.2 A Comparison with Some Estimators Under Normal Mixtures

5 Analysing Regional Tourism Data in Italy

6 Conclusions

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Theorem 1

Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation