1 Introduction

It is not uncommon for a set of variables to be so strongly correlated that they can be considered as intrinsically one-dimensional, meaning that they can be considered to be generated by some latent one-dimensional linear subspace plus noise. As examples for such situations, one could name price indexes for several goods, or educational attainment scores on several different abilities, or several psychological mental health indicators. While a rich set of statistical tools exists for identifying best linear approximations of multivariate data, usually based on algebraic properties of the sample covariance matrix (such as principal component analysis), a different approach is followed in this paper which is firmly rooted in basic principles of statistical modelling, and hence allows versatile access to routine statistical tasks such as clustering or regression.

The basic idea, which is of more general validity than the framework focused on in this paper, is to consider the approximating lower-dimensional subspace as a latent variable in a multivariate statistical model and to model this latent variable by a random effect. In this work, we develop and implement this very general idea in a more specific framework, where we assume the low-dimensional structure to be a one-dimensional space, i.e. a straight line.

More specifically, we consider a scenario where the multivariate data \({x}_{i}\in {\mathbb {R}}^m\) are noisy observations scattered along the one-dimensional space \(\alpha + \beta {z}\), where \(\alpha ,\beta \in {\mathbb {R}}^m\), and \({z}\in {\mathbb {R}}\) is an unobserved instance of a (latent) variable Z. Then, the observed data \(x_i=(x_{i1}, \ldots , x_{im})^T, i=1, \ldots , n\), are assumed to be generated from the following random effect model,

$$\begin{aligned} x_i = \alpha + \beta z_i + \varepsilon _{i}, \end{aligned}$$
(1)

where \(\varepsilon _{i} \sim {N} (0, \Sigma _i) \) is m-variate Gaussian noise with a positive definite variance matrix \(\Sigma _i\equiv \Sigma (z_i) \in {\mathbb {R}}^{m \times m}\). It is clear that a model with observation specific \(m \times m\) variance matrices is heavily overparametrized, and we will never contemplate fitting this model in full generality. We still provide this general notation in Eq. (1) as it contains all practically relevant special cases that will be naturally arising, including, of course, the homoscedastic case \(\Sigma _i=\Sigma \), \(i=1, \ldots ,n\).

For the estimation of the random effect distribution along this line, we use the nonparametric maximum likelihood approach, which amounts to representing this distribution by a set of discrete mass points (mixture centres) with some corresponding masses (mixture probabilities). While this may look like a restrictive assumption, it is actually more flexible than the application of a Gaussian random effect, as it allows for multi-modalities in the distributions of the latent variable. Indeed, the mixture character of this approach allows for clustering of observations based on the fitted model.

In consequence, this arrives at a modelling approach with an enormous versatility. Firstly, as just expressed, observations can be clustered based on maximum a posteriori (MAP) probabilities of class membership [14, chapter 11]. Secondly, projecting the original data points onto the estimated lower-dimensional space, the dimension of the original multivariate data is reduced (to 1, in the simple framework as discussed in this work), and the compressed data can be used as summary statistic (such as an overall price index across several goods) or for further inferential purposes. Thirdly, the relative order of the posterior random effects (observations ‘projected’ onto the latent linear subspace) can be used for ranking observations in multivariate data sets. Finally, we will show that it is not difficult to include additional covariates into model (1) so that one has de facto a novel tool for multivariate response situations, yielding reduced parameter standard errors as compared to the separate univariate response models. We will give each of these important applications some prominence later in the paper.

To enable some intuition for how the model operates (in the simpler case without covariates), we use the faithful data set in R package MASS [20]. This is a two-dimensional data set with 272 observations and two variables: eruption time and the waiting time between two eruptions. The straight line in Fig. 1 is the one-dimensional latent space \(\alpha +\beta z\) that is parameterized by the latent variable. The red triangles positioned along the straight line are the estimated mixture (cluster) centres. To give some metaphor, one could consider the mixture centres as ‘washing pegs’ spanning a ‘washing rope’ holding the clusters. We will return to this example in Sect. 5.1 and illustrate there in detail how exactly this image translates into projections (dimension reduction) and clusterings.

Some methodologically related techniques have been previously suggested in the literature, partly very long ago. In the homoscedastic case, the model (1) can be seen as a one-dimensional factor analysis model (see [14, chapter 12]), with the difference that we will apply a discrete mixture approximation of the latent variable \(z_i\). There is also some overlap with the generative topographic mapping (GTM, [7]), which allows for nonlinear manifolds rather than just a latent straight line. However, in the GTM, the latent variables are parameterized by a fixed and equidistant grid, rather than estimable masses and mass points, rendering the approach less useful for clustering-type applications. Under both the factor analysis and the GTM approaches, there is no immediate possibility to include covariates, and hence, they do not serve as a multivariate response model. Models of the type (1) have also been proposed in the literature on model-based clustering in high-dimensional data scenarios, an overview over which has been given in [8]. Sammel et al. [17] proposed a latent variable model for mixed discrete and continuous outcomes from the exponential family, where, however, the latent variable itself is modelled by covariates, contrasting with the approach investigated in here.

Fig. 1
figure 1

Graph showing the estimated one-dimensional space with cluster centres in red and \(\alpha \) in green

The remainder of this work is organized as follows. In Sect. 2, we give details of the nonparametric maximum likelihood procedure to estimate the parameters of model (1), yielding an EM algorithm which also automatically estimates masses, mass points, and posterior probabilities of data points being associated with those. Simulation studies which illustrate the accuracy of the proposed estimation methodology are presented in Sect. 3. This is followed by Sect. 4, where we will lay down the clustering and projection operations explicitly. Furthermore, in Sects. 4.4 and 4.5, we consider extension of the proposed framework allowing for covariates along with a bootstrap approach for the computation of standard errors. Applications to several real data scenarios are given in Sect. 5, which we also use to illustrate the main application pillars of clustering, dimension reduction, ranking, and regression, explicitly. The paper is concluded with a discussion in Sect. 6. Some technical derivations are relegated to an ‘Appendix’.

2 Methods and Estimation

2.1 Likelihood

The marginal probability density function \({f}({x}_{i}\vert \alpha , \beta )\) for observations generated from model (1) can be written as

$$\begin{aligned} {f}({x}_{i} \vert \alpha , \beta ) = \int {f}({x}_{i}, {z}_{i} \vert \alpha , \beta )\textrm{d}{z}_{i} = \int {f}({x}_{i} \vert {z}_{i}, \alpha , \beta ) \phi ({z}_{i}) \textrm{d}{z}_{i}, \end{aligned}$$
(2)

where \(f({x}_{i},{z}_{i} \vert \alpha , \beta )\) is the joint probability distribution of observed data \({x}_{i}\) and unobserved random effects \({z}_{i}\), and \(\phi (\cdot )\) is the density function of the random effect distribution Z. This model is not fully specified since it lacks specific parametrizations of the (unknown) \(\Sigma _i=\Sigma (z_i)=\text{ Var }(x_i|z_i, \alpha , \beta )\) and \(\phi \), but let us consider any (additional) parameters involved into these initially as nuisance parameters and construct appropriate parametrizations for these as we go along.

The initial goal is to find maximum likelihood estimates for the parameters \(\alpha \) and \(\beta \) in model (1). Building on the marginal density (2), the likelihood of model (1) is the following,

$$\begin{aligned} L(\alpha , \beta ) = \prod _{i=1}^{n} \int {f}({x}_{i} \vert {z}_{i}, \alpha , \beta ) \phi ({z}_{i}) \textrm{d}{z}_{i} \end{aligned}$$

with corresponding log-likelihood,

$$\begin{aligned} l(\alpha , \beta ) = \sum _{i=1}^{n}\log \left\{ \int {f}({x}_{i} \vert {z}_{i}, \alpha , \beta ) \phi ({z}_{i}) \textrm{d}{z}_{i} \right\} . \end{aligned}$$
(3)

At this stage, a decision needs to be made on how to deal with the integral figuring in Eq. (3). In principle, one could do this based on a Gaussianity assumption on \(\phi (\cdot )\), as common in the mixed model context, in this case leading us back to a factor analysis framework. However, for reasons expressed in the introduction, we have decided here differently and employ instead Aitkin’s nonparametric maximum likelihood approach [2]. Here, the random effect distribution Z is approximated by a discrete mixture distribution, say \({\tilde{Z}}\), which is supported on a finite number of mass points \(z_1, \ldots , z_K\) with masses \(P({\tilde{Z}}=z_k)=\pi _k\), \(k=1, \ldots ,K\). This discrete mixture facilitates a simple approximation of the marginal likelihood which just involves sums rather integrals, i.e.

$$\begin{aligned} l(\alpha , \beta ) \approx \sum _{i=1}^{n}\log \left\{ \sum _{k=1}^K {f}({x}_{i}\vert {z}_{k}, \alpha , \beta ) \pi _k \right\} . \end{aligned}$$
(4)

Laird [12] showed that the marginal likelihood (3) can be approximated arbitrarily well by (4) with a finite set of mass points. We see that this has now become a mixture-type problem, with each mixture component k representing a latent ‘class’ within the domain of Z (we will use the terms class and component interchangeably henceforth). The EM algorithm [9] is one of the most widely used algorithms for the estimation of parameters in mixture models.

Denote by \({f}_{ik} = {P} ( {x}_{i}\vert {\tilde{Z}} = {z_k}) = {f}({x}_{i}\vert {z}_{k}, \alpha , \beta )\) the probability density of \(x_i\) conditional on class k. Then, we know that

$$\begin{aligned} {P}({x}_{i}, {\tilde{Z}}=z_{k}) = {P}({x}_{i}\vert {\tilde{Z}}=z_{k}){P}({\tilde{Z}}=z_{k}) = {f}_{ik} {\pi }_{k}. \end{aligned}$$

Since it is in practice unknown which component each observations belongs to, this is an incomplete data scenario. We describe the missing information on the component membership by an indicator variable

$$\begin{aligned} {G}_{ik}= {\left\{ \begin{array}{ll} 1,&{} \text {if observation } \,i\, \text {belongs to component}\, k\\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

This defines complete data \(({x}_{i}, {G}_{i1},\ldots ,{G}_{iK})\), \({i} = 1,\ldots ,n\), with probability

$$\begin{aligned} {P}({x}_{i}, {G}_{i1},\ldots ,{G}_{iK}) = \prod _{k=1}^{K} ({f}_{ik} {\pi }_{k})^{G_{ik}} \end{aligned}$$

and resulting complete data likelihood \( \prod _{i=1}^{n}\prod _{k=1}^{K} ({f}_{ik} {\pi }_{k})^{G_{ik}}\, \). Then, we can obtain the expected complete log-likelihood

$$\begin{aligned} \begin{aligned} {l}_c&= \sum _{i=1}^{n} {\mathbb {E}}\left[ \log \left( \prod _{k=1}^{K} ( {\pi }_{k} f_{ik})^{{G}_{ik}} \right) |x_i \right] \\&= \sum _{i=1}^{n} \sum _{k=1}^{K} {\mathbb {E}}\left[ {G}_{ik}\vert {x}_{i} \right] \log \left( {\pi }_{k} {f} _{ik}\right) \\&= \sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} \log {\pi }_{k} + \sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} \log {f}_{ik} \end{aligned} \end{aligned}$$
(5)

where \({w}_{ik} = {\mathbb {E}}\left[ {G}_{ik}\vert {x}_{i}\right] = {P}({G}_{ik} = {1}\vert {x}_{i}) = {P}({\tilde{Z}} = z_{k}\vert {x}_{i})\), which is the probability of each observation i belonging to component k. For the component-specific densities \(f_{ik}\), we specify, conditional on the mixture centres \(z_k\), a multivariate Gaussian model

$$\begin{aligned} f_{ik} = \frac{1}{(2\pi )^{m/2}}\frac{1}{|\Sigma _k|^{1/2}} \exp \left( -\frac{1}{2}(x_i - \alpha - \beta z_k)^{T}\Sigma _k^{-1} (x_i - \alpha - \beta z_k) \right) \end{aligned}$$
(6)

where we allow the variance matrices \(\Sigma _k =\Sigma (z_k)\) to depend on the cluster k but not on observation i, hence reducing the complexity of the original, fully heteroscedastic, variance specification considerably. The terms \(\alpha + \beta {z}_{k}\) can be interpreted as the mixture centres in the original data space, spanned along the line \(\alpha + \beta z\). Note that the right hand side of (4) is then the likelihood corresponding to the ‘approximative’ model

$$\begin{aligned} x_i|z_k, \alpha , \beta \sim N(\alpha + \beta z_k, \Sigma _k) \, \text{ with } \text{ probability } \, \pi _k, \end{aligned}$$
(7)

where we treat the mass points \(z_k\), \(k=1, \ldots , K\), and their associated masses \(\pi _k\) as unknown parameters to be estimated in the EM algorithm alongside with the model parameters \(\alpha \) and \(\beta \). This model can be seen as a Gaussian mixture model with structured mean function and component-specific variances, or as a multivariate response version of the ‘nonparametric maximum likelihood’ (NPML) approach for the estimation of mixture masses and mass points in random effect models [2, 4, chapter 8].

Several reduced, parsimonious, parameterizations of the variance matrices \(\Sigma _k\) are possible in order to describe the shape of the clusters around the mixture centres. The simplest case (i) would be a constant and diagonal matrix \(\Sigma _k \equiv \Sigma =\text{ diag }(\sigma _{j}^{2})_{\left\{ 1\le {j}\le {m} \right\} }\in {\mathbb {R}}^{m\times m}\), which gives the same variance specification to all K components of the mixture. Second (ii), we consider using different diagonal variance matrices for different components, \(\Sigma _k=\text{ diag }(\sigma _{jk}^{2})_{\left\{ 1\le {j}\le {m}\right\} } \in {\mathbb {R}}^{m\times m}\), which yields an improvement for estimating data that have clusters of different sizes. Third (iii), we consider using the same full (unrestricted) variance matrix, \(\Sigma _k \equiv \Sigma \in {\mathbb {R}}^{m\times m}\), to capture the correlation of variables. Finally (iv), different full (unrestricted) variance matrices, \(\Sigma _k\in {\mathbb {R}}^{m\times m}\) give better estimations when dealing with clusters that differ by shape and size. In line with (6) and (7), our notation in what follows will be tailored to this most general case (iv); with the results for the reduced parameterizations naturally deriving from this.

2.2 EM Algorithm

Now we can set up the EM algorithm for estimating model (7). It is noted that the developments in this subsection are for a fixed number of components, K. The question of choosing K is considered as a model selection problem and will be addressed through the use of model selection criteria as illustrated in later sections.

E-step

Using the Bayes’ theorem, we obtain the posterior probability of observation i belonging to component k,

$$\begin{aligned} {w}_{ik} = {P}({\tilde{Z}} = z_{k}\vert {x}_{i}) = \frac{{P}({\tilde{Z}} = z_{k}) {P}({x}_{i}\vert {\tilde{Z}} = z_{k})}{\sum _{l}^{} {P}({\tilde{Z}} = z_{l}) {P}({x}_{i}\vert {\tilde{Z}} = z_{l})} = \frac{{\pi }_{k} {f}_{ik}}{\sum _{l}{\pi }_{l} {f}_{il} }. \end{aligned}$$
(8)

M-step

Using expression (6) for the component-wise densities \(f_{ik}\), the expected complete data log-likelihood becomes

$$\begin{aligned} \begin{aligned} {l_c}&= \sum _{i=1}^{n}\sum _{k=1}^{K} {w}_{ik}\log (\pi _{k}) + \sum _{i=1}^{n}\sum _{k=1}^{K} -\frac{1}{2}{w}_{ik}\log (|\Sigma _k|) + \sum _{i=1}^{n}\sum _{k=1}^{K}-\frac{m}{2}\log (2\pi ){w}_{ik} \\&\quad + \sum _{i=1}^{n}\sum _{k=1}^{K}-\frac{1}{2}{w}_{ik}(x_i - \alpha - \beta z_k)^{T}\Sigma _k^{-1} (x_i - \alpha - \beta z_k). \end{aligned} \end{aligned}$$
(9)

Taking partial derivatives of \(l_{c}\) with respect to each parameter gives the score equations. We then obtain the following estimators for the parameters \(\alpha \), \(\beta \), \({z}_{k}\) and \(\pi _{k}\) by setting these score equations to zeros and solving them:

$$\begin{aligned}{} & {} {\hat{\alpha }} = \left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}\right) ^{-1}\left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}(x_i- {\hat{\beta }} {\hat{z}}_k)\right) \end{aligned}$$
(10)
$$\begin{aligned}{} & {} {\hat{\beta }} = \left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}{\hat{z}}_k^2\right) ^{-1} \left( \sum _{i=1}^n\sum _{k=1}^Kw_{ik}{\hat{\Sigma }}_k^{-1}(x_i-{\hat{\alpha }}){\hat{z}}_k\right) \end{aligned}$$
(11)
$$\begin{aligned}{} & {} {\hat{z}}_{k} = \frac{\sum _{i=1}^{n} {w}_{ik} {\hat{\beta }}^{T} {\hat{\Sigma }}_{k}^{-1} ({x}_{i} - {\hat{\alpha }})}{\sum _{i=1}^{n} {w}_{ik} {\hat{\beta }}^{T} {\hat{\Sigma }}_{k}^{-1}{\hat{\beta }}}. \end{aligned}$$
(12)

For the mixture probabilities, since \(\sum _{k=1}^{K} \pi _{k} = 1\), we need to apply a Lagrange multiplier by letting \(\partial \left( {l} - \lambda (\sum _{k=1}^{K}\pi _{k} - {1})\right) / \partial \pi _{k} = 0\), yielding

$$\begin{aligned} {\hat{\pi }}_{k} = \frac{1}{n}\sum _{i=1}^{n}{w}_{ik}. \end{aligned}$$
(13)

Estimators for the flexible variance specifications are again obtained by equating the corresponding partial derivatives to zero, giving results as follows:

  1. (i)

    \(\Sigma \) = \(\text{ diag }(\sigma _{j}^{2})_{\left\{ 1\le {j}\le {m} \right\} } \in {\mathbb {R}}^{m\times m}\), \({k} = 1,\ldots ,K,\)

    $$\begin{aligned} {\hat{\sigma }}_{j}^{2} = \frac{1}{n}\sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} ({x}_{ij} - {\hat{\alpha }}_{j} - {\hat{\beta }}_{j}{\hat{z}}_{k})^{2}; \end{aligned}$$
    (14)
  2. (ii)

    \(\Sigma _k\) = \(\text{ diag }(\sigma _{jk}^{2})_{\left\{ 1\le {j}\le {m} \right\} } \in {\mathbb {R}}^{m\times m}\), \({k} = 1,\ldots ,K,\)

    $$\begin{aligned} {\hat{\sigma }}_{jk}^{2} = \frac{\sum _{i=1}^{n} w_{ik}(x_{ij} - {\hat{\alpha }}_{j} - {\hat{\beta }}_{j} {\hat{z}}_{k})^{2}}{\sum _{i=1}^{n} w_{ik}}; \end{aligned}$$
    (15)
  3. (iii)

    \(\Sigma = \Sigma _1 =\cdots = \Sigma _k\in {\mathbb {R}}^{m\times m}\), \({k} = 1,\ldots ,K\),

    $$\begin{aligned} {\hat{\Sigma }} = \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} {w}_{ik} (x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})(x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})^{T}; \end{aligned}$$
    (16)
  4. (iv)

    \(\Sigma _k\in {\mathbb {R}}^{m\times m}, {k} = 1,\ldots ,K,\)

    $$\begin{aligned} {\hat{\Sigma }}_{k} = \frac{\sum _{i=1}^{n} w_{ik}(x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})(x_{i} - {\hat{\alpha }} - {\hat{\beta }} {\hat{z}}_{k})^{T} }{\sum _{i=1}^{n} w_{ik}}. \end{aligned}$$
    (17)

It is evident that all of these estimators depend on the weights \(w_{ik}\), hence requiring the use of the EM algorithm which iterates between finding the above estimates and updating the weights given the estimates.

2.3 Computational Considerations

It is noted from Eqs. (10), (11), and (12) that these involve many inversions of the estimated variance matrices \({\hat{\Sigma }}_k\). This can make the EM algorithm computationally unstable especiallly under the component-specific variance parameterizations (ii) and (iv). Therefore, in our implementation of the above EM algorithm, we disentangle the M-step updates of \({\hat{\Sigma }}_k\) from those of \({\hat{\alpha }}\), \({\hat{\beta }}\) and \({\hat{z}}_k\). Specifically, the updates (10), (11), and (12) are executed in a simplified form where \({\hat{\Sigma }}_k \equiv \text{ diag }(\sigma ^2)\), for some constant \(\sigma ^2\) which does not need to be specified since it cancels out from the resulting simplified update equations, yielding

$$\begin{aligned}{} & {} {\hat{\alpha }} = \frac{1}{n}\left( \sum _{i=1}^{n}{x}_{i} - {\hat{\beta }}\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k}\right) ; \end{aligned}$$
(18)
$$\begin{aligned}{} & {} {\hat{\beta }} = \frac{\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k}{x}_{i} - \frac{1}{n}(\sum _{i=1}^{n}{x}_{i})(\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k})}{\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k}^{2} -\frac{1}{n}(\sum _{i=1}^{n}\sum _{k=1}^{K}{w}_{ik}\hat{{z}}_{k})^{2} }; \end{aligned}$$
(19)
$$\begin{aligned}{} & {} {\hat{z}}_{k} = \frac{{\hat{\beta }}^{T}\sum _{i=1}^{n} {w}_{ik} ({x}_{i} - {\hat{\alpha }})}{ {\hat{\beta }}^{T} {\hat{\beta }}\sum _{i=1}^{n} {w}_{ik}}. \end{aligned}$$
(20)

That is, in our implementation, within each M-step, we cycle a small number times (five will be sufficient) between (18), (19), and (20), then we update \({\hat{\pi }}_k\) via (13), followed by the respective update of the variance matrices according to any of (14), (15), (16), or (17) depending on the variance parameterization. The resulting updated parameters are then used in the upcoming E-step (8). The simulation studies in Sect. 3 will confirm that this approach yields accurate parameter estimates.

2.4 Identifiability

Consider again the model for the \(x_i\) implied by equation (7), i.e.

$$\begin{aligned} x_i = \alpha + \beta z_k + \varepsilon _{i}. \end{aligned}$$
(21)

The product term \(\beta {z}_{k}\) makes the parameters \(\beta =(\beta _1, \ldots , \beta _m)^T\) and \({z}_{k}\) unidentifiable. The vector \(\alpha \) is also unidentifiable as, when moving along the estimated straight line, the same model could be attained by translating all \({z}_{k}\)’s along the line. Therefore, the model is identifiable only under certain restrictions, and in order to fix the problem, we standardize \({z}_{k}\) by letting

$$\begin{aligned} \text{ E }({\tilde{Z}}) = \sum _{k=1}^{K} \pi _{k} {z}_{k} = 0 \end{aligned}$$
(22)

and

$$\begin{aligned} \text{ Var }({\tilde{Z}}) = \sum _{k=1}^{K} \pi _{k} {z}_{k}^{2} - (\pi _{k} {z}_{k})^{2} = 1. \end{aligned}$$
(23)

Equation (22) solves the problem for \(\alpha \) by fixing the position of \({z}_{k}\)’s along the estimated lower-dimensional subspace, and Eq. (23) solves the scale problem for \(\beta \). Additionally, to identify the direction of the latent variable, we enforce \(\beta _1\ge 0\) (but any other component of \(\beta \) could equally be chosen for this).

2.5 Starting Values

Starting values can heavily influence the ability of the EM algorithm to locate the maximum of the likelihood (see, e.g. [15]). In the R implementation of the EM algorithm of our methodology, the following are the default starting values for parameters \(\pi _{k}, {z}_{k}, \alpha , \beta \), and \(\Sigma _{k}\):

$$\begin{aligned} \pi _{k}^{(0)} = \frac{1}{K}, \end{aligned}$$

where K is the number of components. We use random numbers from a standard normal distribution as the starting values for the mass points,

$$\begin{aligned} z_{k}^{(0)} \sim {N}(0,1), \end{aligned}$$

which are then re-scaled according to (22) and (23). As default starting values for the line parameters, we use

$$\begin{aligned} \alpha ^{(0)} =\frac{1}{n}\sum _{i=1}^n x_i \\\beta ^{(0)} = {x}_{r} - \alpha ^{(0)}, \end{aligned}$$

where \({x}_{r}\in {\mathbb {R}}^m\) is a randomly selected observation. For all four variance parameterizations, we use a diagonal matrix \(\Sigma ^{(0)} \in {\mathbb {R}}^{m\times m}\), not depending on k, as the ‘starting variance matrix’. Let

$$\begin{aligned} {s}_{j} = \sqrt{\frac{1}{n-1} \sum _{i=1}^{n}({x}_{ij} - \bar{x}_{j})^{2}}, \end{aligned}$$

where \({j} = 1, 2,\ldots , {m}\) and \({\bar{x}}_j\) is the sample mean of the j-th variable. Then, for each diagonal element \((\sigma ^{(0)}_{j})^{2}\) of the diagonal matrix \(\Sigma ^{(0)}\), one has the starting value

$$\begin{aligned} \sigma _{j}^{(0)} = \frac{s_j}{K}, \,\,\,j=1, \ldots , m. \end{aligned}$$

3 Simulation

3.1 Estimation of Model Parameters

The EM algorithm derived in the previous section, with all four variance parameterizations, is implemented in R. Some simulations are set up to test the accuracy of the R implementation under different settings.

Under the variance parameterization (i), i.e. the same diagonal matrix for all components, we use two-dimensional data with three individual sample sizes \(n = 100\), \(n = 300\), and \(n = 500\) and generate 1000 data sets from model (7) for each sample size. The true parameter values used for the simulations can be read from the first column of Table 1.

The methodology from Sect. 2.2 is then applied on each generated data set (with random starting values according to Sect. 2.5 to initialize the EM algorithm), and the 1000 estimates for each model parameter (see Table 3) are collected. Comparing the average of the estimated values to the true values of the parameters used to generate these data, some key results are shown in Table 1, Figs. 2, 3, and 4. In Table 1, the averaged estimates of the parameters are close to their true values across all parameters and sample sizes, with the bias in the estimates reducing for larger sample sizes. In Fig. 2, the medians of the estimates \({\hat{\beta }}_{1}\) and \({\hat{\beta }}_{2}\) in the three box plots are similar, but with the ranges of the boxes getting smaller when increasing n from 100 via 300 to 500. The effect is clearer visible for the \({\hat{\beta }}_2\)’s than the \({\hat{\beta }}_1\)’s since the larger magnitude of the true value of \(\beta _2\) also comes with larger variability.

Table 1 Simulation results under variance parameterization (i)
Fig. 2
figure 2

Estimations of parameter \(\beta =(\beta _1, \beta _2)^T\) with different sample sizes under the variance parameterization (i)

Fig. 3
figure 3

Estimations of parameter \({z}_{k}\) with different sample sizes under the variance parameterization (i)

Fig. 4
figure 4

Estimations of parameter \(\sigma \) with different sample sizes, where \(\sigma _{1}\) and \(\sigma _{2}\) are the diagonal components of the variance matrix, under the variance parameterization (i)

Similar simulations were conducted to test the accuracy under variance parameterization (ii), again using 1000 replicates of two-dimensional data from model (7) under each of three sample sizes of \(n = 100\), \(n = 300\), and \(n = 500\). We report the numerical results in Table 2 and display the estimated variance structures under this model in Fig. 5. We omit the boxplots for the other parameters as they are similar to those under parameterization (i). In ‘Appendix B’, we provide additional results and boxplots under parameterization (iii).

Table 2 Simulation results under variance parameterization (ii), where \(\sigma _{11}\) and \(\sigma _{21}\) are the diagonal elements of \(\Sigma _{1}\), \(\sigma _{12}\) and \(\sigma _{22}\) are the diagonal elements of \(\Sigma _{2}\)
Fig. 5
figure 5

Under variance parameterization (ii), estimations of variance parameters with different sample sizes, where \(\sigma _{11}\) and \(\sigma _{21}\) are the diagonal components of the variance matrix for mass point \({k} = {1}\), \(\sigma _{12}\) and \(\sigma _{22}\) are the diagonal components of the variance matrix for mass point \({k} = {2}\)

Overall, we can tell from the tables and figures that the estimators give sensible estimates of the parameters, the averaged estimates of the parameters are accurate compared to their true values, there appear to be no systematic biases, and the variability of the estimates reduces with increased sample size. The boxplots illustrate the consistency of estimators, where the boxes are squeezing to the true value (red horizontal line) as the sample size gets larger.

Next, we set up another set of simulations to address the importance of using the correct variance parameterization when fitting a model. For each model with each variance parameterization, we generate 200 replicates, each with sample size of 100, from the model. Then for the data generated from the model with variance parameterization (i), we fit the data to four different models, each with a different variance parameterization. For the remaining data sets generated from the model with variance parameterization (ii), (iii), and (iv), we do the same. We consider to use the AIC and BIC [18] as the model selection criteria, and we use the approximated likelihood (4) as the likelihood in AIC and BIC. For reference, the number of estimated parameters used in the calculation of AIC and BIC is shown in Table 3, where m is the dimension of data, and K is the number of mass points.

Table 3 The number of estimated parameters used for AIC and BIC

Figure 6 shows some key results: For the data sets generated from the model with variance parameterization (i), 73.5% of the fitted models with variance parameterization (i) lead to the smallest AIC values, and 95% of the fitted models with variance parameterization (i) lead to the smallest BIC values. For the data sets generated from the model with variance parameterization (ii), 88% of the fitted models with variance parameterization (ii) lead to the smallest AIC values, and 98% of the fitted models with variance parameterization (ii) lead to the smallest BIC values. For the data sets generated from the model with variance parameterization (iii), 87% of the minimum AIC values and 99% of the minimum BIC values are obtained from a fitted model with the variance parameterization (iii). For data sets generated from the model with variance parameterization (iv) and fitting the model with variance parameterization (iv), we obtain 99% of the minimum AIC values and 91.5% of the minimum BIC values. The results indicate that choosing a correct variance parameterization is significant for fitted model selection. Almohaimeed and Einbeck [6] discussed the use of AIC and BIC for model selection under NPML estimation. Although the BIC might lead to a different choice than AIC, Leroux [13] showed that using BIC for selecting the number of mixture components for finite mixture models is consistent. We use AIC and BIC as model selection criteria in our methodology.

Fig. 6
figure 6

Barplots showing the number of minimum AIC and BIC values obtained from fitted models with different variance parameterizations

4 Additional Inferential Aspects

In the previous sections, the focus was on estimating the parameters of model (7) from multivariate data \(x_i \in {\mathbb {R}}^m\) and demonstrating that these estimators are (in an empirical sense) consistent, and the variance parameterizations are identifiable. In practice, these steps will rarely form an end in itself, but will be building blocks on the way to a more concrete statistical question. We now refer back to the four application pillars already mentioned in the introduction and explain these one by one. Additionally, we will address the important question of how bootstrapped standard errors of covariate parameter estimates are obtained, and how these fare in comparison with univariate response models.

4.1 Clustering via MAP Estimation

We have already observed in Sect. 2.2 that the weights \(w_{ik}\) correspond to the posterior probability of observation i belonging to component k. The term ‘posterior’ is here be to be understood as the updated probability of class membership, having knowledge on the value of the observation \(x_i\), as opposed to the ‘prior’ probability \(\pi _k\), which does not make use of this information.

Given the availability of \(w_{ik}\) from the last iteration of the EM algorithm, observation \(x_i\) is then classified to the cluster \({\hat{k}}(x_i)\) to which it belongs with highest posterior probability,

$$\begin{aligned} {\hat{k}}(x_i)= \text{ arg } \text{ max}_k w_{ik}. \end{aligned}$$

This cluster allocation rule is commonly known as maximum a posterior (MAP) rule. It is noted in this context that, after convergence of the EM algorithm, typically most \(w_{ik}\) are close to 0 or 1 (with obviously only one of them being close to 1), so that this allocation is in most cases very clear-cut. We will see examples for the application of the MAP rule in Sects. 5.1 and 5.3.

4.2 Dimension Reduction Through Predicted Latent Scores

One application of our methodology is the compression of m-dimensional data to one-dimensional, model-based scores, which can be considered as the summary information of the original data. This is achieved through the use of the ‘projection’

$$\begin{aligned} {z}_{i}^*= \sum _{k=1}^{K} {w}_{ik}{\hat{z}}_{k}, \end{aligned}$$
(24)

where \({z}_{i}^*\in {\mathbb {R}}\) [1]. Given the fitted model (1), \({z}_{i}^*\) would be the best prediction of the position for the latent variable \({z}_{i}\) that generates the original data \({x}_{i}\). Then, the following equation maps the one-dimensional scores back onto the higher dimensional original data space,

$$\begin{aligned} {x}_{i}^*= \alpha + \beta {z}_{i}^*, \end{aligned}$$

where \({x}_{i}^*\) is the compressed counterparts to the original data. It is clear that, unlike in, e.g. principal component analysis, the projections \(x_i-x_i^{*}\) are not orthogonal to the linear subspace. However, they still can be meaningful: Under the given approach, all differences between observations to their cluster centres are treated as actual errors. The result of this is an increased robustness to such errors, as only clear deviations from a cluster lead to a projection beyond its centre. An example illustrating this behavior is provided in Fig. 9 in Sect. 5.1.

The one-dimensional scores, \(z_i^*\), can then be used for subsequent inferential procedures, such as a predictor variable in a regression problem involving an external response variable \(y_i\). This approach is illustrated by way of example in Sect. 5.2.

4.3 Ranking

The projected \(z_i^{*}\) provides a ‘summary score’ of all involved variables in the direction spanned by the latent line. Along this line, the positioning of the \(z_i^{*}\) is informative for the degree of which the variables jointly point into the direction of the latent variable. That is, high values of \(z_i^{*}\) would indicate overall high values of the contributing variables, and good agreement of what constitutes ‘high’. For instance, if each of three variables constitutes price indexes for certain goods, then the higher these constituent indices are, the higher the overall price index will be. Hence, the order statstic of the \(z_i^{*}\), denoted by \(z_{[i]}^{*}\), can be used to rank the cases i, namely by [i], \(i=1, \ldots ,n\). Many of these order statistics will be undistinguishable as the projections will be on (or close) to the same cluster centre. This makes sense from a clustering point of view: If observations cannot be distinguished statistically (i.e. if they are just distinguished by noise), their rank cannot be distinguished. De facto, in many cases, the \(z_{[i]}\) will take as many distinguishable values as there are mass points. This concept will be explained in more detail by means of an example in Sect. 5.3.

4.4 Inclusion of Covariates

Where multivariate response data appear in statistical applications, the most common inferential approach is to define separate regression models for each of the individual variables constituting the multivariate response vector. For instance, while the linear model function lm in the statistical programming language R does allow for a multivariate response, the resulting fitted models correspond exactly to the individual one-dimensional response models. This approach, however, is ignoring the correlation of the different response variables, which, when taken into account, could lead to reduced parameter standard errors, and hence increased powers.

In the original model (1), \({x}_{i}\in {\mathbb {R}}^m\) can be explained by a one-dimensional coordinate system. Under the mixture representation of the model (7), certain latent groups along the one-dimensional line are driving the data generating process. However, these models do not yet allow for the presence of covariates in the data generating process of the \(x_i\). To avoid confounding of the latent variable with such covariates (if they are known), the following is an extended model which includes a vector of p covariates related to the response variables,

$$\begin{aligned} x_i = \alpha + \beta z_i + \Gamma v_i + \varepsilon _{i}, \end{aligned}$$
(25)

where \({x}_{i}\in {\mathbb {R}}^{m}\), \({i} = 1, 2,\ldots , n\), \(\alpha \in {\mathbb {R}}^{m}\), \(\beta \in {\mathbb {R}}^{m}\), \({v}_{i}\in {\mathbb {R}}^{p}\) is the vector of the covariates, and \(\Gamma _{m\times p}\) is a matrix of the coefficients of the covariates. The estimators of these parameters can be found in ‘Appendix A’. When we have only one covariate in model (25), \({v}_{i}\in {\mathbb {R}}\), and we denote \(\Gamma = \gamma \in {\mathbb {R}}^m\).

Notably, under model (25) with \({x}_{i}\in {\mathbb {R}}^{m}\), the ‘models’ for each of the m response variables would be linked through the random effect \({z}_{i}\), hence inducing correlation between units similar as for a multilevel model. An example for the use of this modelling technique is provided in Sect. 5.4.

4.5 Bootstrapped Standard Errors

In statistical practice, not only the estimation of \(\Gamma \) but also an assessment of its accuracy (or in other words, a quantification of its uncertainty) is of interest. Since the direct calculation of standard errors is generally difficult in the context of EM estimation, we propose a bootstrap procedure for their computation.

The bootstrap process is carried out with the following steps:

  1. (i)

    We are given a data set \({x}_{i}\in {\mathbb {R}}^{m}\) and a covariate vector \(v_i \in {\mathbb {R}}^p\), \(i=1, \ldots , n\).

  2. (ii)

    Fit the data \({x}_{i}\), \({v}_{i}\) to model (25) to obtain the estimates of the parameters.

  3. (iii)

    Sampling B data sets from model (25) with the estimated parameters obtained from (ii).

  4. (iv)

    Fit these B data sets to our model and we would obtain B sets of \({\hat{\gamma }}\). Then, calculate the standard deviations across all B replicates of each of the \(m \times p\) components of \({\hat{\Gamma }}\).

As an example, we generated a two-dimensional data set \({x}_{i}\in {\mathbb {R}}^{2}\) with \(\pi = (0.3, 0.7)\), \({z} = (1.5, -0.6)\), \(\alpha = (10, 2)\), \(\beta = (1, 3)\), \(\gamma = (0.5, 3)\) and \({B} = 300\). Then, we compared the estimates and standard errors obtained from the use of R function lm (when used as a multivariate response model), with those obtained from the procedures outlined in the previous and current subsection. The results are shown in Table 4; overall, our model leads to considerably smaller standard errors for the estimated coefficient parameter \(\gamma \).

Table 4 Estimations (Standard errors) of \(\gamma \) obtained using different methods
Fig. 7
figure 7

Density contour plots with different variance parameterizations; top left (i); top right (ii); bottom left (iii); and bottom right (iv)

5 Applications

5.1 Faithful Data: Model Selection and Projection

In Sect. 2.2, we introduced four different variance parameterizations; here, we use again the faithful data set to illustrate the effect of using these different variance specifications on model fitting. Figure 7 shows the density contour plots for fitting the model with flexible variance parameterizations (i)–(iv). As shown from Table 5, the AIC and BIC values decrease when increasing the complexity of the variance parameterization, even though of course this does not need to be the case generally.

The following are the parameter estimates from a fitted model with the selected parameterization (iv), i.e. different full variance–covariance matrices for each component: \({\hat{\pi }} = (0.3559, 0.6441)\), \({\hat{\alpha }} = (3.4878, 70.8971)\), \({\hat{\beta }} = (1.0788, 12.2038)\), \({\hat{z}} = (-1.3454, 0.7433)\), and

$$\begin{aligned} {\hat{\Sigma }}_{1} = \begin{bmatrix} 0.0692 &{} 0.4352 \\ 0.4352 &{} 33.6973 \end{bmatrix}, \,\,\,\,{\hat{\Sigma }}_{2} = \begin{bmatrix} 0.1700 &{} 0.9406 \\ 0.9406 &{} 36.0462 \end{bmatrix}. \end{aligned}$$

Figure 8 shows the clustering resulting from these estimates, according to the cluster allocation process that is described in Sect. 4.1.

Table 5 AIC and BIC values for the faithful data under different variance parameterizations

We can obtain the scores (coordinates of the projected data along the one-dimensional subspace spanned by the latent variable) through the use of Eq. (24). We use the following images to illustrate the process of projecting the original data points onto the estimated low-dimensional space. In Fig. 1, the straight line is the one-dimensional latent space, and the red triangles positioned along the straight line are the estimated mixture centres \({\hat{\alpha }} + {\hat{\beta }}{{\hat{z}}}_{k}\). Figure 8 illustrates how the original data are assigned to different clusters following the MAP rule. The green points in Fig. 9 on the straight line are the compressed data, \( x_i^*\), after projection onto that line. The most distinctive character between our methodology and the principal component analysis is that the projections are not orthogonal, which is shown in Fig. 10.

Fig. 8
figure 8

For the faithful data, graph showing the original data points being assigned to different clusters according to the maximum a posterior (MAP) rule

Fig. 9
figure 9

For the faithful data, graph showing the projected data points \({x}_{i}^*\) in green

Fig. 10
figure 10

For the faithful data, graph showing the projections of the original data points onto the estimated straight latent line

5.2 Soils Data: Dimension Reduction

In this example, we consider using the model based scores as the explanatory variable to fit a regression model with an additional new variable as the response variable. The data set we used for this analysis is the Soils data set in R package carData [11]. We construct a data frame with \(n=48\) and six variables: nitrogen, phosphorous, calcium, magnesium, potassium, and sodium (which are highly correlated, but do not all use the same units), and use an additional variable ‘Density’ (bulk density in gm/cm\(^3\)) as the response.

We apply the methodology laid out in Sect. 2.2 on the six-dimensional space of variables and use AIC and BIC to inform the choice of parameterizations and number of mass points. Details of the obtained AIC and BIC values using different number of mass points and variance parameterizations are shown in Tables 6 and 7. The AIC and BIC values given in these tables are the minimum values obtained over 20–50 runs with starting values chosen according to Sect. 2.5; the problem of finding the best solution gets harder when increasing k or the complexity of the error structure. We find that AIC and BIC suggest to use variance parameterization (ii) with 4 mass points or 3 mass points, respectively, to fit the model.

Table 6 AIC values for the Soils data under different variance parameterizations and different number of mass points
Table 7 BIC values for the Soils data under different variance parameterizations and different number of mass points

Next we fit a regression model with the scores \({z}_{i}^*\) being the predictor and the variable Density as response. Principal component regression is a commonly used technique for computing regressions when the explanatory variables are highly correlated. For a fair comparison, we construct the first principal component scores by projecting all data points onto the one-dimensional space and use these scores as the predictor. The fitted lines resulting from using two regression models are shown in Fig. 11. We see that the data are represented quite differently for our methodology. Table 8 shows the statistical measures that evaluate the performance of principal component regression in comparison with our approach (where we have considered both the AIC and BIC solutions). We find that our latent variable approach has a better performance for the non-scaled data. It is not unduly affected by scales or units and is robust concerning scaling.

Fig. 11
figure 11

Graph showing fitted lines using two regression models. In our methodology, the regression used model-based scores (model with \(k = 4\)) as the explanatory variable and another variable in the Soils data set called ‘Density’ as the response variable. The points in black correspond to Density values at the model-based scores, and points in red are the Density values predicted at the mass points. For the principal component regression, the fitted regression line in blue used the first principal component (the blue points) as the explanatory variable and the variable Density as the response variable

Table 8 Statistical measures of fit for the two regression models

5.3 Literacy Survey Data: Clustering and Ranking

League tables are produced for the comparison of different institutions. Aitkin et al. [3] compared student performance under different teaching techniques using variance component models. Aitkin and Longford [5] investigated several modelling approaches for the comparison of school effectiveness studies. Sofroniou et al. [19] used the International Adult Literacy Survey (IALS) data to construct league tables under the NPML estimation approach. In this section, we reconsider this data set for analysis. The International Adult Literacy Survey (IALS) was collected in 13 countries (or country-type entities) on Prose, Document, and Quantitative scales between 1994 and 1995. The data are reported as the percentage of individuals who could not reach a basic level of literacy (being the worst) in each country, the data can be found in ‘Appendix C’.

As in [19], we only use the prose scale for the analysis. However, we take the separation of the reported prose results into male and female attainment differently into account than in that publication. We consider, for each of the 13 countries, male and female prose attainment as a bivariate response, allowing us to employ model (1) to describe the data, and so model (7) for parameter estimation. Since the gender variable is now being taken into account naturally in the response, no covariates at all are required in the model. Furthermore, since under this modelling approach, both female and male prose attainment for a given country are associated with the same random effect, it also eliminates the need to fit a two-level model as in [19] which is otherwise needed to correlate the female and male observations within each country. So, effectively, by using a gender-defined bivariate response, we are ‘taking one level out’ of the problem.

We fit the model with \(k=3\) mass points and with variance parametrization (ii) which leads to a minimum AIC value of 158.3963 and the smallest BIC value of 166.8705. The scores \({z}_{i}^*\) are obtained as the posterior intercept and can be considered as the summary information of the original data. The task is here to rank the observations using the summary information. With the posterior probability matrix \({W}=(w_{ik})\) obtained at the convergence of the EM algorithm, upper-level units (countries) can then be classified into different clusters according to their largest posterior probabilities.

Table 9 shows the joint ranking of the countries, with the countries being classified into different clusters. In the table, the 3 mass points are ordered from left to right, from the cluster in which the country has the smallest percentage of adults being illiterate to the cluster in which the country has the largest percentage of adults being illiterate. The table shows that Sweden is assigned to mass point 1 which has the smallest number of people being illiterate. Poland is the only country that is assigned to the high illiteracy mass point 3. The Netherlands and Germany have posterior probabilities that spread across 2 mass points but are assigned to mass points 1 and 2 according to their highest posterior probabilities. We also fit the model (25) with \({k} = 5\) in order to compare to the results obtained by [19], the results and analysis can be found in ‘Appendix C’.

Table 9 Posterior intercepts and ‘weight matrix’ of posterior probabilities for the IALS data, with implied ranking (‘league table’), for \({k} = 3\)

5.4 Foetal Movement Data: Covariates and Standard Errors

We consider a set of foetal movements data collected before and during the COVID-19 pandemic. The study, which was executed by researchers of the Neonatal Research Lab at Durham University, aims to analyse the effects of COVID on foetal development [16]. The data were recorded via 4D ultrasound scans from a total of 40 mothers (20 before COVID and 20 during COVID) at 32 weeks gestation and consist of the number of movements each foetus carries out in relation to the recordable scan length. The ratio of these counts to scan length then forms the response variables of interest, with the following five specific movements recorded during the 4D ultrasound scans: upper face movements, head movements, mouth movements, touch movements, and eye blink. We are interested in the relationship of these five movements to the variable ‘status’, which indicates the period during which the data were collected (‘pre-COVID’ or ‘during COVID’). For our analysis, this will be considered as a five-variate response, \({x}_{i}\in {\mathbb {R}}^{5}\) whereas status is the predictor, \({v}_{i}\in {\mathbb {R}}\).

We fit the data to model (25) with \({k} = 3\) and variance parametrization (ii) which leads to the smallest AIC (554.3622) value and BIC (613.473) value among all parametrizations and mass points. In principle, one could fit five separate linear regression models, each taking one of the movements score as the response and status as the predictor. We compare the estimates of the parameters and the parameter standard errors using this ‘naïve’ method to our proposed approach, using model (25), where the five equations are linked through a common random effect, the results are shown in Tables 10 and 11. Our methodology, involving a multivariate response model with random effect, gives parameter estimates which are consistent with the ones obtained from separate linear models, however enjoying reduced standard errors of the coefficients. The bottom row of Tables 10 and 11 also gives the p values of the estimated \({\hat{\gamma }}\)’s. We observe that the p values also tend to be reduced, leading to a potentially different decision on the significance of a predictor variable if a decision threshold is crossed.

Table 10 For the COVID data, estimations of \(\gamma \) obtained using individual linear models for upper face movements, head movements, mouth movements, touch movements, and eye blink
Table 11 For the COVID data, estimations of \(\gamma \) obtained using the proposed multivariate response model with random effect

6 Conclusion

Multivariate data are rarely distributed homogeneously in space. In practice, one will often observe that the data reside on a latent linear subspace of a smaller dimension than itself, or that the data are concentrated into a certain number of clusters. From a statistical modelling point of view, these two concepts are usually dealt with in isolation or in succession, but not simultaneously. That is, often one will account for the lower ‘intrinsic’ dimensionality through methods such as principal component analysis, partial least squares, factor analysis, etc., and then account for clustering in the resulting lower-dimensional space (for instance, by fitting a mixture model to the projections onto that space), or, less commonly, firstly partition the data into clusters and then apply separate compressions onto linear subspaces within each of them.

In this work, we have proposed a versatile statistical model based on a latent variable representation, which approaches both of these tasks simultaneously and enables solutions to a wide range of inferential problems, including multivariate regression problems in which the original data space might constitute either the predictors or responses. We have illustrated these scenarios, illuminating different inferential aspects, through a series of examples from various fields, hence illustrating the power of the proposed approach in statistical practice.

Our work has been based on the premise that the data set under investigation does feature latent structures which are worth of identifying or accounting for. The complexity of these latent structures is related to the choice of variance parameterization (i)–(iv). Empirical evidence for the identifiability of these variance matrices has been provided in the simulation section. From a practical point of view, we found variance parameterization (ii)—that is, diagonal, cluster-specific variance matrices–most useful, and also in fact selected by the AIC and BIC criteria in most cases. While the non-diagonal parameterizations (iii) and (iv) may be useful in certain situations, especially when the focus is on accurately describing the cluster structure as in Sect. 5.1, one could, at least conceptually, suspect that situations could arise where the latent variable and the variance matrices ‘compete’ for capturing the direction of the data cloud, hence potentially leading to non-identifiabilities in this respect. While we have not observed such problems in practical data sets, it is the case that convergence of the EM algorithm for scenarios (iii) and (iv) takes longe, and is also more sensitive to the selection of starting points.

As alluded to in the introduction, the basic concept behind the presented approach is not entirely new and has previously been expressed in the neural network community. However, those ideas have not been transferred into the statistical toolbox and embedded into a statistical modelling framework (as done here through the use of random effects) so far. It is also pointed out that several extensions of this work are possible, including the use of nonlinear or multivariate latent spaces with appropriate random effect specifications. Further, one could consider extending this framework towards non-Gaussian response distributions, requiring however more complex, GLM-type estimation methods.

We close with noting that our work can be considered as a particular type of multilevel (i.e. here, two-level) model, with the upper level corresponding to observations \(x_i\) and the lower level to ‘measurements’ \(x_{ij}\) on the ‘repeated responses’. However, as we have seen in the example in Sect. 5.3, the shared random effect on the ‘upper level’ is directly obtained from the inferential framework without resorting to two-level (‘variance component’) modelling in a traditional sense. Spinning this thought further, the present methodology allows for a reduction of the number of levels in a genuine multilevel scenario. For instance, assume one has repeated measures of some quantity taken on the left and right ear of some individuals over time [21]. Then, rather than fitting a three-level model, the two ears could define the axes of a bivariate response model, reducing the problem to a two-level model. Work on such problems is in progress and will be reported elsewhere.