A Versatile Model for Clustered and Highly Correlated Multivariate Data

Zhang, Yingjuan; Einbeck, Jochen

doi:10.1007/s42519-023-00357-0

A Versatile Model for Clustered and Highly Correlated Multivariate Data

Original Article
Open access
Published: 03 January 2024

Volume 18, article number 5, (2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

A Versatile Model for Clustered and Highly Correlated Multivariate Data

Download PDF

486 Accesses
1 Citation
Explore all metrics

Abstract

For the analysis of multivariate data with an approximately one-dimensional latent structure, it is suggested to model this latent variable by a random effect, allowing for the use of mixed model methodology for dimension reduction purposes. We implement this idea through the mixture-based approach for the estimation of random effect models, hence conveniently enabling clustering of observations along the latent linear subspace, and derive the estimators required for the ensuing EM algorithm under several error variance parameterizations. A simulation study is conducted, and several important inferential problems, including clustering, projection, ranking, regression on covariates, and regression of an external response on the predicted latent variable, are considered and illustrated by real data examples.

Parsimonious Finite Mixtures of Matrix-Variate Regressions

The Generalized Linear Mixed Cluster-Weighted Model

Article 18 April 2015

Seemingly unrelated clusterwise linear regression

Article 12 August 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

It is not uncommon for a set of variables to be so strongly correlated that they can be considered as intrinsically one-dimensional, meaning that they can be considered to be generated by some latent one-dimensional linear subspace plus noise. As examples for such situations, one could name price indexes for several goods, or educational attainment scores on several different abilities, or several psychological mental health indicators. While a rich set of statistical tools exists for identifying best linear approximations of multivariate data, usually based on algebraic properties of the sample covariance matrix (such as principal component analysis), a different approach is followed in this paper which is firmly rooted in basic principles of statistical modelling, and hence allows versatile access to routine statistical tasks such as clustering or regression.

The basic idea, which is of more general validity than the framework focused on in this paper, is to consider the approximating lower-dimensional subspace as a latent variable in a multivariate statistical model and to model this latent variable by a random effect. In this work, we develop and implement this very general idea in a more specific framework, where we assume the low-dimensional structure to be a one-dimensional space, i.e. a straight line.

More specifically, we consider a scenario where the multivariate data ${x}_{i}\in {\mathbb {R}}^m$ are noisy observations scattered along the one-dimensional space $\alpha + \beta {z}$, where $\alpha ,\beta \in {\mathbb {R}}^m$, and ${z}\in {\mathbb {R}}$ is an unobserved instance of a (latent) variable Z. Then, the observed data $x_i=(x_{i1}, \ldots , x_{im})^T, i=1, \ldots , n$, are assumed to be generated from the following random effect model,

$$\begin{aligned} x_i = \alpha + \beta z_i + \varepsilon _{i}, \end{aligned}$$

(1)

where $\varepsilon _{i} \sim {N} (0, \Sigma _i) $ is m-variate Gaussian noise with a positive definite variance matrix $\Sigma _i\equiv \Sigma (z_i) \in {\mathbb {R}}^{m \times m}$. It is clear that a model with observation specific $m \times m$ variance matrices is heavily overparametrized, and we will never contemplate fitting this model in full generality. We still provide this general notation in Eq. (1) as it contains all practically relevant special cases that will be naturally arising, including, of course, the homoscedastic case $\Sigma _i=\Sigma $, $i=1, \ldots ,n$.

For the estimation of the random effect distribution along this line, we use the nonparametric maximum likelihood approach, which amounts to representing this distribution by a set of discrete mass points (mixture centres) with some corresponding masses (mixture probabilities). While this may look like a restrictive assumption, it is actually more flexible than the application of a Gaussian random effect, as it allows for multi-modalities in the distributions of the latent variable. Indeed, the mixture character of this approach allows for clustering of observations based on the fitted model.

In consequence, this arrives at a modelling approach with an enormous versatility. Firstly, as just expressed, observations can be clustered based on maximum a posteriori (MAP) probabilities of class membership [14, chapter 11]. Secondly, projecting the original data points onto the estimated lower-dimensional space, the dimension of the original multivariate data is reduced (to 1, in the simple framework as discussed in this work), and the compressed data can be used as summary statistic (such as an overall price index across several goods) or for further inferential purposes. Thirdly, the relative order of the posterior random effects (observations ‘projected’ onto the latent linear subspace) can be used for ranking observations in multivariate data sets. Finally, we will show that it is not difficult to include additional covariates into model (1) so that one has de facto a novel tool for multivariate response situations, yielding reduced parameter standard errors as compared to the separate univariate response models. We will give each of these important applications some prominence later in the paper.

To enable some intuition for how the model operates (in the simpler case without covariates), we use the faithful data set in R package MASS [20]. This is a two-dimensional data set with 272 observations and two variables: eruption time and the waiting time between two eruptions. The straight line in Fig. 1 is the one-dimensional latent space $\alpha +\beta z$ that is parameterized by the latent variable. The red triangles positioned along the straight line are the estimated mixture (cluster) centres. To give some metaphor, one could consider the mixture centres as ‘washing pegs’ spanning a ‘washing rope’ holding the clusters. We will return to this example in Sect. 5.1 and illustrate there in detail how exactly this image translates into projections (dimension reduction) and clusterings.

Some methodologically related techniques have been previously suggested in the literature, partly very long ago. In the homoscedastic case, the model (1) can be seen as a one-dimensional factor analysis model (see [14, chapter 12]), with the difference that we will apply a discrete mixture approximation of the latent variable $z_i$. There is also some overlap with the generative topographic mapping (GTM, [7]), which allows for nonlinear manifolds rather than just a latent straight line. However, in the GTM, the latent variables are parameterized by a fixed and equidistant grid, rather than estimable masses and mass points, rendering the approach less useful for clustering-type applications. Under both the factor analysis and the GTM approaches, there is no immediate possibility to include covariates, and hence, they do not serve as a multivariate response model. Models of the type (1) have also been proposed in the literature on model-based clustering in high-dimensional data scenarios, an overview over which has been given in [8]. Sammel et al. [17] proposed a latent variable model for mixed discrete and continuous outcomes from the exponential family, where, however, the latent variable itself is modelled by covariates, contrasting with the approach investigated in here.

The remainder of this work is organized as follows. In Sect. 2, we give details of the nonparametric maximum likelihood procedure to estimate the parameters of model (1), yielding an EM algorithm which also automatically estimates masses, mass points, and posterior probabilities of data points being associated with those. Simulation studies which illustrate the accuracy of the proposed estimation methodology are presented in Sect. 3. This is followed by Sect. 4, where we will lay down the clustering and projection operations explicitly. Furthermore, in Sects. 4.4 and 4.5, we consider extension of the proposed framework allowing for covariates along with a bootstrap approach for the computation of standard errors. Applications to several real data scenarios are given in Sect. 5, which we also use to illustrate the main application pillars of clustering, dimension reduction, ranking, and regression, explicitly. The paper is concluded with a discussion in Sect. 6. Some technical derivations are relegated to an ‘Appendix’.

2 Methods and Estimation

2.1 Likelihood

The marginal probability density function ${f}({x}_{i}\vert \alpha , \beta )$ for observations generated from model (1) can be written as

$$\begin{aligned} {f}({x}_{i} \vert \alpha , \beta ) = \int {f}({x}_{i}, {z}_{i} \vert \alpha , \beta )\textrm{d}{z}_{i} = \int {f}({x}_{i} \vert {z}_{i}, \alpha , \beta ) \phi ({z}_{i}) \textrm{d}{z}_{i}, \end{aligned}$$

(2)

where $f({x}_{i},{z}_{i} \vert \alpha , \beta )$ is the joint probability distribution of observed data ${x}_{i}$ and unobserved random effects ${z}_{i}$, and $\phi (\cdot )$ is the density function of the random effect distribution Z. This model is not fully specified since it lacks specific parametrizations of the (unknown) $\Sigma _i=\Sigma (z_i)=\text{ Var }(x_i|z_i, \alpha , \beta )$ and $\phi $, but let us consider any (additional) parameters involved into these initially as nuisance parameters and construct appropriate parametrizations for these as we go along.

The initial goal is to find maximum likelihood estimates for the parameters $\alpha $ and $\beta $ in model (1). Building on the marginal density (2), the likelihood of model (1) is the following,

$$\begin{aligned} L(\alpha , \beta ) = \prod _{i=1}^{n} \int {f}({x}_{i} \vert {z}_{i}, \alpha , \beta ) \phi ({z}_{i}) \textrm{d}{z}_{i} \end{aligned}$$

with corresponding log-likelihood,

$$\begin{aligned} l(\alpha , \beta ) = \sum _{i=1}^{n}\log \left\{ \int {f}({x}_{i} \vert {z}_{i}, \alpha , \beta ) \phi ({z}_{i}) \textrm{d}{z}_{i} \right\} . \end{aligned}$$

(3)

At this stage, a decision needs to be made on how to deal with the integral figuring in Eq. (3). In principle, one could do this based on a Gaussianity assumption on $\phi (\cdot )$, as common in the mixed model context, in this case leading us back to a factor analysis framework. However, for reasons expressed in the introduction, we have decided here differently and employ instead Aitkin’s nonparametric maximum likelihood approach [2]. Here, the random effect distribution Z is approximated by a discrete mixture distribution, say ${\tilde{Z}}$, which is supported on a finite number of mass points $z_1, \ldots , z_K$ with masses $P({\tilde{Z}}=z_k)=\pi _k$, $k=1, \ldots ,K$. This discrete mixture facilitates a simple approximation of the marginal likelihood which just involves sums rather integrals, i.e.

$$\begin{aligned} l(\alpha , \beta ) \approx \sum _{i=1}^{n}\log \left\{ \sum _{k=1}^K {f}({x}_{i}\vert {z}_{k}, \alpha , \beta ) \pi _k \right\} . \end{aligned}$$

(4)

Laird [12] showed that the marginal likelihood (3) can be approximated arbitrarily well by (4) with a finite set of mass points. We see that this has now become a mixture-type problem, with each mixture component k representing a latent ‘class’ within the domain of Z (we will use the terms class and component interchangeably henceforth). The EM algorithm [9] is one of the most widely used algorithms for the estimation of parameters in mixture models.

Denote by ${f}_{ik} = {P} ( {x}_{i}\vert {\tilde{Z}} = {z_k}) = {f}({x}_{i}\vert {z}_{k}, \alpha , \beta )$ the probability density of $x_i$ conditional on class k. Then, we know that

$$\begin{aligned} {P}({x}_{i}, {\tilde{Z}}=z_{k}) = {P}({x}_{i}\vert {\tilde{Z}}=z_{k}){P}({\tilde{Z}}=z_{k}) = {f}_{ik} {\pi }_{k}. \end{aligned}$$

Since it is in practice unknown which component each observations belongs to, this is an incomplete data scenario. We describe the missing information on the component membership by an indicator variable

$$\begin{aligned} {G}_{ik}= {\left\{ \begin{array}{ll} 1,&{} \text {if observation } \,i\, \text {belongs to component}\, k\\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

This defines complete data $({x}_{i}, {G}_{i1},\ldots ,{G}_{iK})$, ${i} = 1,\ldots ,n$, with probability

$$\begin{aligned} {P}({x}_{i}, {G}_{i1},\ldots ,{G}_{iK}) = \prod _{k=1}^{K} ({f}_{ik} {\pi }_{k})^{G_{ik}} \end{aligned}$$

and resulting complete data likelihood $ \prod _{i=1}^{n}\prod _{k=1}^{K} ({f}_{ik} {\pi }_{k})^{G_{ik}}\, $. Then, we can obtain the expected complete log-likelihood

$$\begin{aligned} \begin{aligned} {l}_c&= \sum _{i=1}^{n} {\mathbb {E}}\left[ \log \left( \prod _{k=1}^{K} ( {\pi }_{k} f_{ik})^{{G}_{ik}} \right) |x_i \right] \\&= \sum _{i=1}^{n} \sum _{k=1}^{K} {\mathbb {E}}\left[ {G}_{ik}\vert {x}_{i} \right] \log \left( {\pi }_{k} {f} _{ik}\right) \\&= \sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} \log {\pi }_{k} + \sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} \log {f}_{ik} \end{aligned} \end{aligned}$$

(5)

where ${w}_{ik} = {\mathbb {E}}\left[ {G}_{ik}\vert {x}_{i}\right] = {P}({G}_{ik} = {1}\vert {x}_{i}) = {P}({\tilde{Z}} = z_{k}\vert {x}_{i})$, which is the probability of each observation i belonging to component k. For the component-specific densities $f_{ik}$, we specify, conditional on the mixture centres $z_k$, a multivariate Gaussian model

$$\begin{aligned} f_{ik} = \frac{1}{(2\pi )^{m/2}}\frac{1}{|\Sigma _k|^{1/2}} \exp \left( -\frac{1}{2}(x_i - \alpha - \beta z_k)^{T}\Sigma _k^{-1} (x_i - \alpha - \beta z_k) \right) \end{aligned}$$

(6)

where we allow the variance matrices $\Sigma _k =\Sigma (z_k)$ to depend on the cluster k but not on observation i, hence reducing the complexity of the original, fully heteroscedastic, variance specification considerably. The terms $\alpha + \beta {z}_{k}$ can be interpreted as the mixture centres in the original data space, spanned along the line $\alpha + \beta z$. Note that the right hand side of (4) is then the likelihood corresponding to the ‘approximative’ model

$$\begin{aligned} x_i|z_k, \alpha , \beta \sim N(\alpha + \beta z_k, \Sigma _k) \, \text{ with } \text{ probability } \, \pi _k, \end{aligned}$$

(7)

where we treat the mass points $z_k$, $k=1, \ldots , K$, and their associated masses $\pi _k$ as unknown parameters to be estimated in the EM algorithm alongside with the model parameters $\alpha $ and $\beta $. This model can be seen as a Gaussian mixture model with structured mean function and component-specific variances, or as a multivariate response version of the ‘nonparametric maximum likelihood’ (NPML) approach for the estimation of mixture masses and mass points in random effect models [2, 4, chapter 8].

Several reduced, parsimonious, parameterizations of the variance matrices $\Sigma _k$ are possible in order to describe the shape of the clusters around the mixture centres. The simplest case (i) would be a constant and diagonal matrix $\Sigma _k \equiv \Sigma =\text{ diag }(\sigma _{j}^{2})_{\left\{ 1\le {j}\le {m} \right\} }\in {\mathbb {R}}^{m\times m}$, which gives the same variance specification to all K components of the mixture. Second (ii), we consider using different diagonal variance matrices for different components, $\Sigma _k=\text{ diag }(\sigma _{jk}^{2})_{\left\{ 1\le {j}\le {m}\right\} } \in {\mathbb {R}}^{m\times m}$, which yields an improvement for estimating data that have clusters of different sizes. Third (iii), we consider using the same full (unrestricted) variance matrix, $\Sigma _k \equiv \Sigma \in {\mathbb {R}}^{m\times m}$, to capture the correlation of variables. Finally (iv), different full (unrestricted) variance matrices, $\Sigma _k\in {\mathbb {R}}^{m\times m}$ give better estimations when dealing with clusters that differ by shape and size. In line with (6) and (7), our notation in what follows will be tailored to this most general case (iv); with the results for the reduced parameterizations naturally deriving from this.

2.2 EM Algorithm

Now we can set up the EM algorithm for estimating model (7). It is noted that the developments in this subsection are for a fixed number of components, K. The question of choosing K is considered as a model selection problem and will be addressed through the use of model selection criteria as illustrated in later sections.

E-step

Using the Bayes’ theorem, we obtain the posterior probability of observation i belonging to component k,

$$\begin{aligned} {w}_{ik} = {P}({\tilde{Z}} = z_{k}\vert {x}_{i}) = \frac{{P}({\tilde{Z}} = z_{k}) {P}({x}_{i}\vert {\tilde{Z}} = z_{k})}{\sum _{l}^{} {P}({\tilde{Z}} = z_{l}) {P}({x}_{i}\vert {\tilde{Z}} = z_{l})} = \frac{{\pi }_{k} {f}_{ik}}{\sum _{l}{\pi }_{l} {f}_{il} }. \end{aligned}$$

(8)

M-step

Using expression (6) for the component-wise densities $f_{ik}$, the expected complete data log-likelihood becomes

$$\begin{aligned} \begin{aligned} {l_c}&= \sum _{i=1}^{n}\sum _{k=1}^{K} {w}_{ik}\log (\pi _{k}) + \sum _{i=1}^{n}\sum _{k=1}^{K} -\frac{1}{2}{w}_{ik}\log (|\Sigma _k|) + \sum _{i=1}^{n}\sum _{k=1}^{K}-\frac{m}{2}\log (2\pi ){w}_{ik} \\&\quad + \sum _{i=1}^{n}\sum _{k=1}^{K}-\frac{1}{2}{w}_{ik}(x_i - \alpha - \beta z_k)^{T}\Sigma _k^{-1} (x_i - \alpha - \beta z_k). \end{aligned} \end{aligned}$$

(9)

Taking partial derivatives of $l_{c}$ with respect to each parameter gives the score equations. We then obtain the following estimators for the parameters $\alpha $, $\beta $, ${z}_{k}$ and $\pi _{k}$ by setting these score equations to zeros and solving them:

$$\begin{aligned}{} & {} {\hat{\alpha }} = \left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}\right) ^{-1}\left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}(x_i- {\hat{\beta }} {\hat{z}}_k)\right) \end{aligned}$$

(10)

$$\begin{aligned}{} & {} {\hat{\beta }} = \left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}{\hat{z}}_k^2\right) ^{-1} \left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}(x_i-{\hat{\alpha }}){\hat{z}}_k\right) \end{aligned}$$

(11)

$$\begin{aligned}{} & {} {\hat{z}}_{k} = \frac{\sum _{i=1}^{n} {w}_{ik} {\hat{\beta }}^{T} {\hat{\Sigma }}_{k}^{-1} ({x}_{i} - {\hat{\alpha }})}{\sum _{i=1}^{n} {w}_{ik} {\hat{\beta }}^{T} {\hat{\Sigma }}_{k}^{-1}{\hat{\beta }}}. \end{aligned}$$

(12)

For the mixture probabilities, since $\sum _{k=1}^{K} \pi _{k} = 1$, we need to apply a Lagrange multiplier by letting $\partial \left( {l} - \lambda (\sum _{k=1}^{K}\pi _{k} - {1})\right) / \partial \pi _{k} = 0$, yielding

$$\begin{aligned} {\hat{\pi }}_{k} = \frac{1}{n}\sum _{i=1}^{n}{w}_{ik}. \end{aligned}$$

(13)

Estimators for the flexible variance specifications are again obtained by equating the corresponding partial derivatives to zero, giving results as follows:

(i)
$\Sigma $ = $\text{ diag }(\sigma _{j}^{2})_{\left\{ 1\le {j}\le {m} \right\} } \in {\mathbb {R}}^{m\times m}$, ${k} = 1,\ldots ,K,$
$$\begin{aligned} {\hat{\sigma }}_{j}^{2} = \frac{1}{n}\sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} ({x}_{ij} - {\hat{\alpha }}_{j} - {\hat{\beta }}_{j}{\hat{z}}_{k})^{2}; \end{aligned}$$
(14)
(ii)
$\Sigma _k$ = $\text{ diag }(\sigma _{jk}^{2})_{\left\{ 1\le {j}\le {m} \right\} } \in {\mathbb {R}}^{m\times m}$, ${k} = 1,\ldots ,K,$
$$\begin{aligned} {\hat{\sigma }}_{jk}^{2} = \frac{\sum _{i=1}^{n} w_{ik}(x_{ij} - {\hat{\alpha }}_{j} - {\hat{\beta }}_{j} {\hat{z}}_{k})^{2}}{\sum _{i=1}^{n} w_{ik}}; \end{aligned}$$
(15)
(iii)
$\Sigma = \Sigma _1 =\cdots = \Sigma _k\in {\mathbb {R}}^{m\times m}$, ${k} = 1,\ldots ,K$,
$$\begin{aligned} {\hat{\Sigma }} = \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} (x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})(x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})^{T}; \end{aligned}$$
(16)
(iv)
$\Sigma _k\in {\mathbb {R}}^{m\times m}, {k} = 1,\ldots ,K,$
$$\begin{aligned} {\hat{\Sigma }}_{k} = \frac{\sum _{i=1}^{n} w_{ik}(x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})(x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})^{T} }{\sum _{i=1}^{n} w_{ik}}. \end{aligned}$$
(17)

It is evident that all of these estimators depend on the weights $w_{ik}$, hence requiring the use of the EM algorithm which iterates between finding the above estimates and updating the weights given the estimates.

2.3 Computational Considerations

It is noted from Eqs. (10), (11), and (12) that these involve many inversions of the estimated variance matrices ${\hat{\Sigma }}_k$. This can make the EM algorithm computationally unstable especiallly under the component-specific variance parameterizations (ii) and (iv). Therefore, in our implementation of the above EM algorithm, we disentangle the M-step updates of ${\hat{\Sigma }}_k$ from those of ${\hat{\alpha }}$, ${\hat{\beta }}$ and ${\hat{z}}_k$. Specifically, the updates (10), (11), and (12) are executed in a simplified form where ${\hat{\Sigma }}_k \equiv \text{ diag }(\sigma ^2)$, for some constant $\sigma ^2$ which does not need to be specified since it cancels out from the resulting simplified update equations, yielding

$$\begin{aligned}{} & {} {\hat{\alpha }} = \frac{1}{n}\left( \sum _{i=1}^{n}{x}_{i} - {\hat{\beta }}\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k}\right) ; \end{aligned}$$

(18)

$$\begin{aligned}{} & {} {\hat{\beta }} = \frac{\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k}{x}_{i} - \frac{1}{n}(\sum _{i=1}^{n}{x}_{i})(\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k})}{\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k}^{2} -\frac{1}{n}(\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k})^{2} }; \end{aligned}$$

(19)

$$\begin{aligned}{} & {} {\hat{z}}_{k} = \frac{{\hat{\beta }}^{T}\sum _{i=1}^{n} {w}_{ik} ({x}_{i} - {\hat{\alpha }})}{ {\hat{\beta }}^{T} {\hat{\beta }}\sum _{i=1}^{n} {w}_{ik}}. \end{aligned}$$

(20)

That is, in our implementation, within each M-step, we cycle a small number times (five will be sufficient) between (18), (19), and (20), then we update ${\hat{\pi }}_k$ via (13), followed by the respective update of the variance matrices according to any of (14), (15), (16), or (17) depending on the variance parameterization. The resulting updated parameters are then used in the upcoming E-step (8). The simulation studies in Sect. 3 will confirm that this approach yields accurate parameter estimates.

2.4 Identifiability

Consider again the model for the $x_i$ implied by equation (7), i.e.

$$\begin{aligned} x_i = \alpha + \beta z_k + \varepsilon _{i}. \end{aligned}$$

(21)

The product term $\beta {z}_{k}$ makes the parameters $\beta =(\beta _1, \ldots , \beta _m)^T$ and ${z}_{k}$ unidentifiable. The vector $\alpha $ is also unidentifiable as, when moving along the estimated straight line, the same model could be attained by translating all ${z}_{k}$’s along the line. Therefore, the model is identifiable only under certain restrictions, and in order to fix the problem, we standardize ${z}_{k}$ by letting

$$\begin{aligned} \text{ E }({\tilde{Z}}) = \sum _{k=1}^{K} \pi _{k} {z}_{k} = 0 \end{aligned}$$

(22)

and

$$\begin{aligned} \text{ Var }({\tilde{Z}}) = \sum _{k=1}^{K} \pi _{k} {z}_{k}^{2} - (\pi _{k} {z}_{k})^{2} = 1. \end{aligned}$$

(23)

Equation (22) solves the problem for $\alpha $ by fixing the position of ${z}_{k}$’s along the estimated lower-dimensional subspace, and Eq. (23) solves the scale problem for $\beta $. Additionally, to identify the direction of the latent variable, we enforce $\beta _1\ge 0$ (but any other component of $\beta $ could equally be chosen for this).

2.5 Starting Values

Starting values can heavily influence the ability of the EM algorithm to locate the maximum of the likelihood (see, e.g. [15]). In the R implementation of the EM algorithm of our methodology, the following are the default starting values for parameters $\pi _{k}, {z}_{k}, \alpha , \beta $, and $\Sigma _{k}$:

$$\begin{aligned} \pi _{k}^{(0)} = \frac{1}{K}, \end{aligned}$$

where K is the number of components. We use random numbers from a standard normal distribution as the starting values for the mass points,

$$\begin{aligned} z_{k}^{(0)} \sim {N}(0,1), \end{aligned}$$

which are then re-scaled according to (22) and (23). As default starting values for the line parameters, we use

$$\begin{aligned} \alpha ^{(0)} =\frac{1}{n}\sum _{i=1}^n x_i \\\beta ^{(0)} = {x}_{r} - \alpha ^{(0)}, \end{aligned}$$

where ${x}_{r}\in {\mathbb {R}}^m$ is a randomly selected observation. For all four variance parameterizations, we use a diagonal matrix $\Sigma ^{(0)} \in {\mathbb {R}}^{m\times m}$, not depending on k, as the ‘starting variance matrix’. Let

$$\begin{aligned} {s}_{j} = \sqrt{\frac{1}{n-1} \sum _{i=1}^{n}({x}_{ij} - \bar{x}_{j})^{2}}, \end{aligned}$$

where ${j} = 1, 2,\ldots , {m}$ and ${\bar{x}}_j$ is the sample mean of the j-th variable. Then, for each diagonal element $(\sigma ^{(0)}_{j})^{2}$ of the diagonal matrix $\Sigma ^{(0)}$, one has the starting value

$$\begin{aligned} \sigma _{j}^{(0)} = \frac{s_j}{K}, \,\,\,j=1, \ldots , m. \end{aligned}$$

3 Simulation

3.1 Estimation of Model Parameters

The EM algorithm derived in the previous section, with all four variance parameterizations, is implemented in R. Some simulations are set up to test the accuracy of the R implementation under different settings.

Under the variance parameterization (i), i.e. the same diagonal matrix for all components, we use two-dimensional data with three individual sample sizes $n = 100$, $n = 300$, and $n = 500$ and generate 1000 data sets from model (7) for each sample size. The true parameter values used for the simulations can be read from the first column of Table 1.

The methodology from Sect. 2.2 is then applied on each generated data set (with random starting values according to Sect. 2.5 to initialize the EM algorithm), and the 1000 estimates for each model parameter (see Table 3) are collected. Comparing the average of the estimated values to the true values of the parameters used to generate these data, some key results are shown in Table 1, Figs. 2, 3, and 4. In Table 1, the averaged estimates of the parameters are close to their true values across all parameters and sample sizes, with the bias in the estimates reducing for larger sample sizes. In Fig. 2, the medians of the estimates ${\hat{\beta }}_{1}$ and ${\hat{\beta }}_{2}$ in the three box plots are similar, but with the ranges of the boxes getting smaller when increasing n from 100 via 300 to 500. The effect is clearer visible for the ${\hat{\beta }}_2$’s than the ${\hat{\beta }}_1$’s since the larger magnitude of the true value of $\beta _2$ also comes with larger variability.

Table 1 Simulation results under variance parameterization (i)

A Versatile Model for Clustered and Highly Correlated Multivariate Data

Abstract

Similar content being viewed by others

Parsimonious Finite Mixtures of Matrix-Variate Regressions

The Generalized Linear Mixed Cluster-Weighted Model

Seemingly unrelated clusterwise linear regression

1 Introduction

2 Methods and Estimation

2.1 Likelihood

2.2 EM Algorithm

2.3 Computational Considerations

2.4 Identifiability

2.5 Starting Values

3 Simulation

3.1 Estimation of Model Parameters

4 Additional Inferential Aspects

4.1 Clustering via MAP Estimation

4.2 Dimension Reduction Through Predicted Latent Scores

4.3 Ranking

4.4 Inclusion of Covariates

4.5 Bootstrapped Standard Errors

5 Applications

5.1 Faithful Data: Model Selection and Projection

5.2 Soils Data: Dimension Reduction

5.3 Literacy Survey Data: Clustering and Ranking

5.4 Foetal Movement Data: Covariates and Standard Errors

6 Conclusion

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Appendices

A Estimators for Advanced Models

B Additional Simulation Results

IALS Data

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation