High-dimensional regression with gaussian mixtures and partially-latent response variables

Abstract

The problem of approximating high-dimensional data with a low-dimensional representation is addressed. The article makes the following contributions. An inverse regression framework is proposed, which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. A mixture of locally-linear probabilistic mapping model is introduced, that starts with estimating the parameters of the inverse regression, and follows with inferring closed-form solutions for the forward parameters of the high-dimensional regression problem of interest. Moreover, a partially-latent paradigm is introduced, such that the vector-valued response variable is composed of both observed and latent entries, thus being able to deal with data contaminated by experimental artifacts that cannot be explained with noise models. The proposed probabilistic formulation could be viewed as a latent-variable augmentation of regression. Expectation-maximization (EM) procedures are introduced, based on a data augmentation strategy which facilitates the maximum-likelihood search over the model parameters. Two augmentation schemes are proposed and the associated EM inference procedures are described in detail; they may well be viewed as generalizations of a number of EM regression, dimension reduction, and factor analysis algorithms. The proposed framework is validated with both synthetic and real data. Experimental evidence is provided that the method outperforms several existing regression techniques.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    https://team.inria.fr/perception/gllim_toolbox/.

  2. 2.

    https://team.inria.fr/perception/research/high-dim-regression/.

  3. 3.

    http://www.mvrvm.com/Multivariate_Relevance_Vector.

  4. 4.

    \(\mathrm{SNR} = 10 \log (\Vert {{\varvec{y}}}^2/\Vert {{\varvec{e}}}^2)\).

  5. 5.

    http://isomap.stanford.edu/datasets.html.

  6. 6.

    We kept one every 4 pixels horizontally and vertically.

  7. 7.

    \(\text{ NMRSE } =\sqrt{\frac{\sum _{m=1}^M (\hat{t}_m - t_m)^2}{\sum _{m=1}^M (t_m - \overline{t})^2}}\) with \(\overline{t} = M^{-1} \sum _{m=1}^M t_m\).

  8. 8.

    More details on this evaluation are available with the Supplementary Material available one line at https://team.inria.fr/perception/research/high-dim-regression/.

References

  1. Adragni, K.P., Cook, R.D.: Sufficient dimension reduction and prediction in regression. Philos. Trans. R. Soc. A 367(1906), 4385–4405 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  2. Agarwal, A., Triggs, B.: Learning to track 3D human motion from silhouettes. In: International conference on machine learning, pp. 9–16. Banff, Canada (2004)

  3. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 44–58 (2006)

    Article  Google Scholar 

  4. Bach, F.R., Jordan, M.I.: A probabilistic interpretation of canonical correlation analysis. Tech. Rep. 688, Department of Statistics, University of California, Berkeley (2005)

  5. Bernard-Michel, C., Douté, S., Fauvel, M., Gardes, L., Girard, S.: Retrieval of Mars surface physical properties from OMEGA hyperspectral images using regularized sliced inverse regression. J. Geophys. Res. 114(E6), (2009)

  6. Bibring, J.P., Soufflot, A., Berthé, M., Langevin, Y., Gondet, B., Drossart, P., Bouyé, M., Combes, M., Puget, P., Semery, A., et al.: Omega: observatoire pour la minéralogie, l’eau, les glaces et l’activité. Mars express: the scientific payload 1240, 37–49 (2004)

  7. Bishop, C.M., Svensén, M., Williams, C.K.I.: GTM: the generative topographic mapping. Neural Comput 10(1), 215–234 (1998)

    Article  Google Scholar 

  8. Bouveyron, C., Celeux, G., Girard, S.: Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA. Pattern Recognit. Lett. 32, 1706–1713 (2011)

    Article  Google Scholar 

  9. Cook, R.D.: Fisher lecture: dimension reduction in regression. Stat. Sci. 22(1), 1–26 (2007)

    Article  MATH  Google Scholar 

  10. de Veaux, R.D.: Mixtures of linear regressions. Comput. Stat. Data Anal. 8(3), 227–245 (1989)

    Article  MATH  Google Scholar 

  11. Deleforge, A., Horaud, R.: 2D sound-source localization on the binaural manifold. In: IEEE workshop on machine learning for signal processing, Santander, Spain, (2012)

  12. Deleforge, A., Forbes, F., Horaud, R.: Acoustic space learning for sound-source separation and localization on binaural manifolds. Int. J. Neural Syst., (2014)

  13. Douté, S., Deforas, E., Schmidt, F., Oliva, R., Schmitt, B.: A comprehensive numerical package for the modeling of Mars hyperspectral images. In: The 38th Lunar and Planetary Science Conference, (Lunar and Planetary Science XXXVIII), (2007)

  14. Fusi, N., Stegle, O., Lawrence, N.: Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput. Biol. 8(1):e1002, 330, (2012)

    Google Scholar 

  15. Gershenfeld, N.: Nonlinear inference and cluster-weighted modeling. Ann. N. Y. Acad. Sci. 808(1), 18–24 (1997)

    Article  Google Scholar 

  16. Ghahramani, Z., Hinton, G.E.: The EM algorithm for mixtures of factor analyzers. Tech. Rep. CRG-TR-96-1, University of Toronto, (1996)

  17. Ingrassia, S., Minotti, S.C., Vittadini, G.: Local statistical modeling via a cluster-weighted approach with elliptical distributions. J. Classif. 29(3), 363–401 (2012)

    MathSciNet  Article  Google Scholar 

  18. Jedidi, K., Ramaswamy, V., DeSarbo, W.S., Wedel, M.: On estimating finite mixtures of multivariate regression and simultaneous equation models. Struct. Equ. Model. 3(3), 266–289 (1996)

    Article  Google Scholar 

  19. Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle, WA, USA 1, 285–288 (1998)

  20. Kalaitzis, A., Lawrence, N.: Residual component analysis: Generalising pca for more flexible inference in linear-gaussian models. In: International Conference on Machine Learning, Edinburgh, Scotland, UK, (2012)

  21. Lawrence, N.: Probabilistic non-linear principal component analysis with gaussian process latent variable models. J. Mach. Learn. Res. 6, 1783–1816 (2005)

    MathSciNet  MATH  Google Scholar 

  22. Li, K.C.: Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86(414), 316–327 (1991)

    Article  MATH  Google Scholar 

  23. McLachlan, G.J., Peel, D.: Robust cluster analysis via mixtures of multivariate t-distributions. In: Lecture Notes in Computer Science, pp 658–666. Springer, Berlin (1998)

  24. McLachlan, G.J., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41(3–4), 379–388 (2003)

    MathSciNet  Article  MATH  Google Scholar 

  25. Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)

    Google Scholar 

  26. Meng, X.L., Van Dyk, D.: The EM algorithm: an old folk-song sung to a fast new tune. J. R. Stat. Soc. B 59(3), 511–567 (1997)

    Article  MATH  Google Scholar 

  27. Naik, P., Tsai, C.L.: Partial least squares estimator for single-index models. J. R. Stat. Soc. B 62(4), 763–771 (2000)

    MathSciNet  Article  MATH  Google Scholar 

  28. Qiao, Y., Minematsu, N.: Mixture of probabilistic linear regressions: a unified view of GMM-based mapping techiques. In: IEEE international conference on acoustics, speech, and signal processing, pp 3913–3916, (2009)

  29. Quandt, R.E., Ramsey, J.B.: Estimating mixtures of normal distributions and switching regressions. J. Am. Stat. Assoc. 73(364), 730–738 (1978)

    MathSciNet  Article  MATH  Google Scholar 

  30. Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection, lecture notes in computer science, vol 3940, pp 34–51. Springer, Berlin (2006)

  31. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)

    MathSciNet  Article  Google Scholar 

  32. Talmon, R., Cohen, I., Gannot, S.: Supervised source localization using diffusion kernels. In: Workshop on Applications of Signal Processing to Audio and Acoustics, pp 245–248, (2011)

  33. Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P., Cipolla, R.: Multivariate relevance vector machines for tracking. In: European conference on computer vision, pp 124–138. Springer, Heidelberg (2006)

  34. Tipping, M.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001)

    MathSciNet  MATH  Google Scholar 

  35. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999a)

    Article  Google Scholar 

  36. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. R. Stat. Soc. B 61(3), 611–622 (1999b)

    MathSciNet  Article  MATH  Google Scholar 

  37. Vapnik, V., Golowich, S., Smola, A.: Support vector method for function approximation, regression estimation, and signal processing. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in neural information processing, pp. 281–287. MIT Press, Cambridge (1997)

    Google Scholar 

  38. Wang, C., Neal, R.M.: Gaussian process regression with heteroscedastic or non-gaussian residuals. Computing Research Repository. (2012)

  39. Wedel, M., Kamakura, W.A.: Factor analysis with (mixed) observed and latent variables in the exponential family. Psychometrika 66(4), 515–530 (2001)

    MathSciNet  Article  MATH  Google Scholar 

  40. Wu, H.: Kernel sliced inverse regression with applications to classification. J. Comput. Graph. Stat. 17(3), 590–610 (2008)

    Google Scholar 

  41. Xu, L., Jordan, M.I., Hinton, G.E.: An alternative model for mixtures of experts. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in neural information processing systems, pp. 633–640. MIT Press, Cambridge (1995)

    Google Scholar 

  42. Zhao, J.H., Yu, P.L.: Fast ML estimation for the mixture of factor analyzers via an ECM algorithm. IEEE Trans. Neural Netw. 19(11), 1956–1961 (2008)

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to thank the anonymous reviewers for their constructive remarks and suggestions which helped organizing and improving this article.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Antoine Deleforge.

Appendices

Appendix 1: Link between joint GMM and GLLiM

Proposition 1

A GLLiM model on \({\varvec{X}},{\varvec{Y}}\) with unconstrained parameters \(\varvec{\theta }= \{{\varvec{c}}_k,\varvec{\Gamma }_k,\pi _k,\mathbf{A}_k,\varvec{b}_k,\varvec{\Sigma }_k\}_{k=1}^K\) is equivalent to a Gaussian mixture model on the joint variable \([{\varvec{X}};{\varvec{Y}}]\) with unconstrained parameters \(\varvec{\psi }=\{{\varvec{m}}_k,\mathbf{V}_k,\rho _k\}_{k=1}^K\), i.e.,

$$\begin{aligned} p({\varvec{X}}={\varvec{x}},{\varvec{Y}}={\varvec{y}};\varvec{\theta })=\sum _{k=1}^K \rho _k \mathcal {N}([{\varvec{x}};{\varvec{y}}];{\varvec{m}}_k,\mathbf{V}_k). \end{aligned}$$
(42)

The parameter vector \(\varvec{\theta }\) can be expressed as a function of \(\varvec{\psi }\) by:

$$\begin{aligned}&\pi _k =\rho _k, \nonumber \\&{\varvec{c}}_k ={\varvec{m}}^{\mathrm{x}}_{k}, \nonumber \\&\varvec{\Gamma }_k =\mathbf{V}_k^{\mathrm{xx}}, \nonumber \\&\mathbf{A}_k = \mathbf{V}_k^{\mathrm{xy}\top }(\mathbf{V}_k^{\mathrm{xx}})^{-1}, \nonumber \\&\varvec{b}_k = {\varvec{m}}^{\mathrm{y}}_{k}-(\mathbf{V}_k^{\mathrm{xy}})^{\top }(\mathbf{V}_k^{\mathrm{xx}})^{-1}{\varvec{m}}^{\mathrm{x}}_{k},\nonumber \\&\varvec{\Sigma }_k = \mathbf{V}_k^{\mathrm{yy}}-(\mathbf{V}_k^{\mathrm{xy}})^{\top }(\mathbf{V}_k^{\mathrm{xx}})^{-1}\mathbf{V}_k^{\mathrm{xy}},\nonumber \\&\mathrm{where} \quad {\varvec{m}}_k =\left[ \begin{array}{c} {\varvec{m}}^{\mathrm{x}}_{k}\\ {\varvec{m}}^{\mathrm{y}}_{k} \end{array} \right] \nonumber \\&\mathrm{and} \quad \mathbf{V}_k=\left[ \begin{array}{lc} \mathbf{V}_k^{\mathrm{xx}} &{} \mathbf{V}_k^{\mathrm{xy}}\\ \mathbf{V}_k^{\mathrm{xy}\top } &{} \mathbf{V}_k^{\mathrm{yy}} \end{array} \right] . \end{aligned}$$
(43)

The parameter \(\varvec{\psi }\) can be expressed as a function of \(\varvec{\theta }\) by:

$$\begin{aligned} \begin{array}{rl} \rho _k=&{}\pi _k, \\ {\varvec{m}}_k=&{}\left[ \begin{array}{c} {\varvec{c}}_k \\ \mathbf{A}_k{\varvec{c}}_k+\varvec{b}_k \end{array} \right] , \\ \mathbf{V}_k=&{}\left[ \begin{array}{cc} \varvec{\Gamma }_k &{} \varvec{\Gamma }_k\mathbf{A}_k^{\top }\\ \mathbf{A}_k\varvec{\Gamma }_k &{} \varvec{\Sigma }_k+\mathbf{A}_k\varvec{\Gamma }_k\mathbf{A}_k^{\top }\end{array} \right] . \end{array} \end{aligned}$$
(44)

Note that this proposition was proved for \(D=1\) in (Ingrassia et al. 2012), but not in the general case as proposed here.

Proof

(43) is obtained using (44) and formulas for conditional multivariate Gaussian variables. (44) is obtained from standard algebra by identifying the joint distribution \(p({\varvec{X}},{\varvec{Y}}|{\varvec{Z}};\varvec{\theta })\) defined by (3) and (4) with a multivariate Gaussian distribution. To complete the proof, one need to prove the following two statements:

  1. (i)

    For any \(\rho _k\in \mathbb {R}, {\varvec{m}}_k\in \mathbb {R}^{D+L}\) and \(\mathbf{V}_k\in \mathcal {S}^{L+D}_+\), there is a set of parameters \({\varvec{c}}_k\in \mathbb {R}^L,\varvec{\Gamma }_k\in \mathcal {S}^{L}_+,\pi _k\in \mathbb {R},\mathbf{A}_k\in \mathbb {R}^{D\times L},\varvec{b}_k\in \mathbb {R}^D\) and \(\varvec{\Sigma }_k\in \mathcal {S}^{D}_+\) such that (43) holds.

  2. (ii)

    Reciprocally, for any \({\varvec{c}}_k\in \mathbb {R}^L,\varvec{\Gamma }_k\in \mathcal {S}^{L}_+,\pi _k\in \mathbb {R},\mathbf{A}_k\in \mathbb {R}^{D\times L},\varvec{b}_k\in \mathbb {R}^D,\varvec{\Sigma }_k\in \mathcal {S}^{D}_+\) there is a set of parameters \(\rho _k\in \mathbb {R}, {\varvec{m}}_k\in \mathbb {R}^{L+D}\) and \(\mathbf{V}_k\in \mathcal {S}^{D+L}_+\) such that (44) holds,

where \(\mathcal {S}^{M}_+\) denotes the set of \(M\times M\) symmetric positive definite matrices. We introduce the following lemma. \(\square \)

Lemma 1

If

$$\begin{aligned} \mathbf{V}=\left[ \begin{array}{lc} \mathbf{V}^{ xx } &{} \mathbf{V}^{ xy }\\ \mathbf{V}^{ xy \top } &{} \mathbf{V}^{ yy } \end{array} \right] \in \mathcal {S}^{L+D}_+, \end{aligned}$$

then \(\varvec{\Sigma }=\mathbf{V}^{ yy } - \mathbf{V}^{ xy \top }\mathbf{V}^{ xx -1}\mathbf{V}^{ xy }\in \mathcal {S}^{D}_+\).

Proof

Since \(\mathbf{V}\in \mathcal {S}^{L+D}_+\) we have \({\varvec{u}}^{\top }\mathbf{V}{\varvec{u}}>0\) for all non null \({\varvec{u}}\in \mathbb {R}^{L+D*}\). Using the decomposition \({\varvec{u}}=[{\varvec{u}}^{\mathrm{x}};{\varvec{u}}^{\mathrm{y}}]\) we obtain

$$\begin{aligned}&{\varvec{u}}^{\mathrm{x}\top }\mathbf{V}^{\mathrm{xx}}{\varvec{u}}^{\mathrm{x}}+ 2{\varvec{u}}^{\mathrm{x}\top }\mathbf{V}^{\mathrm{xy}}{\varvec{u}}^{\mathrm{y}}+ {\varvec{u}}^{\mathrm{y}\top }\mathbf{V}^{\mathrm{yy}}{\varvec{u}}^{\mathrm{y}}>0\\&\quad \forall {\varvec{u}}^{\mathrm{x}}\in \mathbb {R}^{L*}, \quad \forall {\varvec{u}}^{\mathrm{y}}\in \mathbb {R}^{D*}. \end{aligned}$$

In particular, for \({\varvec{u}}^{\mathrm{x}}=-\mathbf{V}^{\mathrm{xx}-1}{\varvec{u}}^{\mathrm{y}}\mathbf{V}^{\mathrm{xy}}\) we obtain

$$\begin{aligned}&{\varvec{u}}^{\mathrm{y}\top }(\mathbf{V}^{\mathrm{yy}}-\mathbf{V}^{\mathrm{xy}\top }\mathbf{V}^{\mathrm{xx}-1}\mathbf{V}^{\mathrm{xy}}){\varvec{u}}^{\mathrm{y}}>0 \,\Leftrightarrow \, {\varvec{u}}^{\mathrm{y}\top }\varvec{\Sigma }{\varvec{u}}^{\mathrm{y}}>0\\&\quad \forall {\varvec{u}}^{\mathrm{y}}\in \mathbb {R}^{D*} \end{aligned}$$

and hence \(\varvec{\Sigma }\in \mathcal {S}^{D}_+\). \(\square \)

Lemma 2

If \(\mathbf{A}\in \mathbb {R}^{D\times L}, \varvec{\Gamma }\in \mathcal {S}^{L}_+, \varvec{\Sigma }\in \mathcal {S}^{D}_+\), then

$$\begin{aligned} \mathbf{V}=\left[ \begin{array}{cc} \varvec{\Gamma }&{} \varvec{\Gamma }\mathbf{A}^{\top }\\ \mathbf{A}\varvec{\Gamma }&{} \varvec{\Sigma }+\mathbf{A}\varvec{\Gamma }\mathbf{A}^{\top }\end{array} \right] \in \mathcal {S}^{L+D}_+. \end{aligned}$$

Proof

Since \(\varvec{\Gamma }\in \mathcal {S}^{L}_+\) there is a unique symmetric positive definite matrix \(\varvec{\Lambda }\in \mathcal {S}^{L}_+\) such that \(\varvec{\Gamma }=\varvec{\Lambda }^2\). Using standard algebra, we obtain that for all non null \({\varvec{u}}=[{\varvec{u}}^{\mathrm{x}};{\varvec{u}}^{\mathrm{y}}]\in \mathbb {R}^{L+D*}\),

$$\begin{aligned} {\varvec{u}}^{\top }\mathbf{V}{\varvec{u}}=||\varvec{\Lambda }{\varvec{u}}^{\mathrm{x}}+\varvec{\Lambda }\mathbf{A}^{\top }{\varvec{u}}^{\mathrm{y}}||^2+{\varvec{u}}^{\mathrm{y}\top }\varvec{\Sigma }{\varvec{u}}^{\mathrm{y}} \end{aligned}$$

where \(||.||\) denotes the standard Euclidean distance. The first term of the sum is positive for all \([{\varvec{u}}^{\mathrm{x}};{\varvec{u}}^{\mathrm{y}}]\in \mathbb {R}^{L+D*}\) and the second term strictly positive for all \({\varvec{u}}^{\mathrm{y}}\in \mathbb {R}^{D*}\) because \(\varvec{\Sigma }\in \mathcal {S}^{D}_+\) by hypothesis. Therefore, \(\mathbf{V}\in \mathcal {S}^{L+D}_+\).

Lemma 1 and the correspondence formulae (43) prove (i), Lemma 2 and the correspondence formulae (44) prove (ii), hence completing the proof.\(\square \)

Appendix 2: The marginal hybrid GLLiM-EM

By marginalizing out the hidden variables \({\varvec{W}}_{1:N}\), we obtain a different EM algorithm than the one presented in Sect. 5, with hidden variables \({\varvec{Z}}_{1:N}\) only. For a clearer connection with standard procedures we assume here, as already specified, that \({\varvec{c}}_k^{\mathrm{w}} = \varvec{0}_{L_{\mathrm{w}}}\) and \(\varvec{\Gamma }_k^{\mathrm{w}}= \mathbf{I}_{L_{\mathrm{w}}}\). The E–W-step disappears while the E–Z-step and the following updating of \(\pi _k\), \({\varvec{c}}_k^{\mathrm{t}}\) and \(\varvec{\Gamma }_k^{\mathrm{t}}\) in the M-GMM-step are exactly the same as in Sect. 5.4. However, the marginalization of \({\varvec{W}}_{1:N}\) leads to a clearer separation between the regression parameters \(\mathbf{A}_k^{\mathrm{t}}\) and \(\varvec{b}_k\) (M-regression-step) and the other parameters \(\mathbf{A}_k^{\mathrm{w}}\) and \(\varvec{\Sigma }_k\) (M-residual-step). This can be seen straightforwardly from Eq. (19) which shows that after marginalizing \({\varvec{W}}\), the model parameters separate into a standard regression part \({\varvec{A}}_k^{\mathrm{t}} {\varvec{t}}_n + \varvec{b}_k\) for which standard estimators do not involve the noise variance and a PPCA-like part on the regression residuals \({\varvec{y}}_n- \tilde{{\varvec{A}}}_k^{\mathrm{t}} {\varvec{t}}_n - \tilde{\varvec{b}}_k\), in which the non standard noise covariance \(\varvec{\Sigma }_k + \mathbf{A}_k^{\mathrm{w}} (\mathbf{A}_k^{\mathrm{w}})^{\top }\) is typically dealt with by adding a latent variable \({\varvec{W}}\).

The algorithm is therefore made of the E–Z-step and M-GMM-step detailed in 5.4, and the following M-steps:

M-regression-step:

The \(\mathbf{A}_k^{\mathrm{t}}\) and \(\varvec{b}_k\) parameters are obtained using standard weighted affine regression from \(\{{\varvec{t}}_n\}_{n=1}^N\) to \(\{{\varvec{y}}_n\}_{n=1}^N\) with weights \(\widetilde{r}_{nk}\), i.e.,

$$\begin{aligned} \widetilde{\mathbf{A}}^{\mathrm{t}}_k=\widetilde{\mathbf{Y}}_k\widetilde{\mathbf{T}}_k^{\top }(\widetilde{\mathbf{T}}_k\widetilde{\mathbf{T}}_k^{\top })^{-1}, \widetilde{\varvec{b}}_k = \sum _{n=1}^N\frac{\widetilde{r}_{kn}}{\widetilde{r}_{k}}({\varvec{y}}_n-\widetilde{\mathbf{A}}_k^t{\varvec{t}}_{n}), \end{aligned}$$
(45)

with

$$\begin{aligned} \widetilde{\mathbf{T}}_k = \left[ \sqrt{\widetilde{r}_{1k}}({\varvec{t}}_1-\widetilde{{\varvec{t}}}_k) \cdots \sqrt{\widetilde{r}_{Nk}}({\varvec{t}}_N-\widetilde{{\varvec{t}}}_k)\right] /\sqrt{\widetilde{r}_k} \end{aligned}$$

and with

$$\begin{aligned} \widetilde{{\varvec{t}}}_k = \sum _{n=1}^N(\widetilde{r}_{kn}/\widetilde{r}_k){\varvec{t}}_n\;. \end{aligned}$$

M-residual-step:

Optimal values for \(\mathbf{A}^{\mathrm{w}}_k\) and \(\varvec{\Sigma }_k\) are obtained by minimization of the following criterion:

$$\begin{aligned} Q_k(\varvec{\Sigma }_k,\mathbf{A}_k^{\mathrm{w}})&= - \frac{1}{2}\biggl (\log |\varvec{\Sigma }_k+\mathbf{A}_k^{\mathrm{w}}\mathbf{A}_k^{\mathrm{w}\top }|\nonumber \\&\quad +\sum _{n=1}^N{\varvec{u}}_{kn}^{\top }(\varvec{\Sigma }_k+\mathbf{A}_k^{\mathrm{w}}\mathbf{A}_k^{\mathrm{w}\top })^{-1}{\varvec{u}}_{kn}\biggr ), \end{aligned}$$
(46)

where \({\varvec{u}}_{kn}=\sqrt{\widetilde{r}_{nk}/\widetilde{r}_k}({\varvec{y}}_n-\widetilde{\mathbf{A}}^{\mathrm{t}}_k{\varvec{t}}_n-\widetilde{\varvec{b}}_k)\). Vectors \(\{{\varvec{u}}_{kn}\}_{n=1}^N\) can be seen as the residuals of the \(k\)-th local affine transformation. No closed-form solution exists in the general case. A first option is to make use of an inner loop such as a gradient descent technique, or to consider \(Q_k\) as the new target observed-data likelihood and use an inner EM corresponding to the general EM described in previous section with \(L_{\mathrm{t}}=0\) and \(K=1\). Another option is to use the Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin 1993) proposed by (Zhao and Yu 2008). The ECM algorithm replaces the M-step of the EM algorithm with a sequence of conditional maximization (CM) steps. Such CM steps lead, in the general case, to a conditional (to \(\varvec{\Sigma }_k\)) update of \(\mathbf{A}_k^{\mathrm{w}}\) which is similar to PPCA (Tipping and Bishop 1999b) with an isotropic noise variance provided and equal to 1. It follows very convenient closed-form expressions (Zhao and Yu 2008) as is detailed below. (Zhao and Yu 2008) shows that such an ECM algorithm is computationally more efficient than EM in the case of large sample size relative to the data dimension and that the reverse may as well be true in other situations.

However, in the particular case \(\varvec{\Sigma }_k=\sigma ^2_k\mathbf{I}_D\), we can afford a standard EM as it connects to PPCA. Indeed, one may notice that \(Q_k\) has then exactly the same form as the observed-data log-likelihood in PPCA, with parameters \((\sigma ^2_k,\mathbf{A}_k^{\mathrm{w}})\) and observations \(\{{\varvec{u}}_{kn}\}_{n=1}^N\). Denoting with \(\mathbf{C}_k= \sum _{n=1}^N {\varvec{u}}_{kn} {\varvec{u}}_{kn}^T/N\) the \(D\times D\) sample residual covariance matrix and with \(\lambda _{1k}>\dots >\lambda _{Dk}\) its eigenvalues in decreasing order, we can therefore use the key result of (Tipping and Bishop 1999b) to see that a global maximum of \(Q_k\) is obtained for

$$\begin{aligned} \widetilde{\mathbf{A}}_k^{\mathrm{w}}&= \mathbf{U}_k(\varvec{\Lambda }_k-\sigma ^2_k\mathbf{I}_{L_{\mathrm{w}}})^{1/2},\end{aligned}$$
(47)
$$\begin{aligned} \widetilde{\sigma }^2_k&= \frac{\sum _{d=L_{\mathrm{w}}+1}^D\lambda _{dk}}{D-L_{\mathrm{w}}}, \end{aligned}$$
(48)

where \(\mathbf{U}_k\) denotes the \(D\times L_{\mathrm{w}}\) matrix whose column vectors are the first eigenvectors of \(\mathbf{C}_k\) and \(\varvec{\Lambda }_k\) is a \(L_{\mathrm{w}}\times L_{\mathrm{w}}\) diagonal matrix containing the corresponding first eigenvalues.

The hybrid nature of hGLLiM (at the crossroads of regression and dimensionality reduction) is striking in this variant, as it alternates between a mixture-of-Gaussians step, a local-linear-regression step and a local-linear-dimensionality-reduction step on residuals. This variant is also much easier to initialize as a set of initial posterior values \(\{r_{nk}^{(0)}\}_{n=1,k=1}^{N,K}\) can be obtained using the \(K\)-means algorithm or the standard GMM-EM algorithm on \({\varvec{t}}_{1:N}\) or on the joint data \([{\varvec{y}};{\varvec{t}}]_{1:N}\) as done in (Qiao and Minematsu 2009) before proceeding to the M-step. On the other hand, due to the time-consuming eigenvalue decomposition needed at each iteration, the marginal hGLLiM-EM turns out to be slower that the general hGLLiM-EM algorithm described in Sect. 5. We thus use the marginal algorithm as an initialization procedure for the general hGLLiM-EM algorithm.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Deleforge, A., Forbes, F. & Horaud, R. High-dimensional regression with gaussian mixtures and partially-latent response variables. Stat Comput 25, 893–911 (2015). https://doi.org/10.1007/s11222-014-9461-5

Download citation

Keywords

  • Regression
  • Latent variable
  • Mixture models
  • Expectation-maximization
  • Dimensionality reduction