Abstract
The problem of approximating high-dimensional data with a low-dimensional representation is addressed. The article makes the following contributions. An inverse regression framework is proposed, which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. A mixture of locally-linear probabilistic mapping model is introduced, that starts with estimating the parameters of the inverse regression, and follows with inferring closed-form solutions for the forward parameters of the high-dimensional regression problem of interest. Moreover, a partially-latent paradigm is introduced, such that the vector-valued response variable is composed of both observed and latent entries, thus being able to deal with data contaminated by experimental artifacts that cannot be explained with noise models. The proposed probabilistic formulation could be viewed as a latent-variable augmentation of regression. Expectation-maximization (EM) procedures are introduced, based on a data augmentation strategy which facilitates the maximum-likelihood search over the model parameters. Two augmentation schemes are proposed and the associated EM inference procedures are described in detail; they may well be viewed as generalizations of a number of EM regression, dimension reduction, and factor analysis algorithms. The proposed framework is validated with both synthetic and real data. Experimental evidence is provided that the method outperforms several existing regression techniques.
This is a preview of subscription content, log in to check access.








Notes
- 1.
- 2.
- 3.
- 4.
\(\mathrm{SNR} = 10 \log (\Vert {{\varvec{y}}}^2/\Vert {{\varvec{e}}}^2)\).
- 5.
- 6.
We kept one every 4 pixels horizontally and vertically.
- 7.
\(\text{ NMRSE } =\sqrt{\frac{\sum _{m=1}^M (\hat{t}_m - t_m)^2}{\sum _{m=1}^M (t_m - \overline{t})^2}}\) with \(\overline{t} = M^{-1} \sum _{m=1}^M t_m\).
- 8.
More details on this evaluation are available with the Supplementary Material available one line at https://team.inria.fr/perception/research/high-dim-regression/.
References
Adragni, K.P., Cook, R.D.: Sufficient dimension reduction and prediction in regression. Philos. Trans. R. Soc. A 367(1906), 4385–4405 (2009)
Agarwal, A., Triggs, B.: Learning to track 3D human motion from silhouettes. In: International conference on machine learning, pp. 9–16. Banff, Canada (2004)
Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 44–58 (2006)
Bach, F.R., Jordan, M.I.: A probabilistic interpretation of canonical correlation analysis. Tech. Rep. 688, Department of Statistics, University of California, Berkeley (2005)
Bernard-Michel, C., Douté, S., Fauvel, M., Gardes, L., Girard, S.: Retrieval of Mars surface physical properties from OMEGA hyperspectral images using regularized sliced inverse regression. J. Geophys. Res. 114(E6), (2009)
Bibring, J.P., Soufflot, A., Berthé, M., Langevin, Y., Gondet, B., Drossart, P., Bouyé, M., Combes, M., Puget, P., Semery, A., et al.: Omega: observatoire pour la minéralogie, l’eau, les glaces et l’activité. Mars express: the scientific payload 1240, 37–49 (2004)
Bishop, C.M., Svensén, M., Williams, C.K.I.: GTM: the generative topographic mapping. Neural Comput 10(1), 215–234 (1998)
Bouveyron, C., Celeux, G., Girard, S.: Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA. Pattern Recognit. Lett. 32, 1706–1713 (2011)
Cook, R.D.: Fisher lecture: dimension reduction in regression. Stat. Sci. 22(1), 1–26 (2007)
de Veaux, R.D.: Mixtures of linear regressions. Comput. Stat. Data Anal. 8(3), 227–245 (1989)
Deleforge, A., Horaud, R.: 2D sound-source localization on the binaural manifold. In: IEEE workshop on machine learning for signal processing, Santander, Spain, (2012)
Deleforge, A., Forbes, F., Horaud, R.: Acoustic space learning for sound-source separation and localization on binaural manifolds. Int. J. Neural Syst., (2014)
Douté, S., Deforas, E., Schmidt, F., Oliva, R., Schmitt, B.: A comprehensive numerical package for the modeling of Mars hyperspectral images. In: The 38th Lunar and Planetary Science Conference, (Lunar and Planetary Science XXXVIII), (2007)
Fusi, N., Stegle, O., Lawrence, N.: Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput. Biol. 8(1):e1002, 330, (2012)
Gershenfeld, N.: Nonlinear inference and cluster-weighted modeling. Ann. N. Y. Acad. Sci. 808(1), 18–24 (1997)
Ghahramani, Z., Hinton, G.E.: The EM algorithm for mixtures of factor analyzers. Tech. Rep. CRG-TR-96-1, University of Toronto, (1996)
Ingrassia, S., Minotti, S.C., Vittadini, G.: Local statistical modeling via a cluster-weighted approach with elliptical distributions. J. Classif. 29(3), 363–401 (2012)
Jedidi, K., Ramaswamy, V., DeSarbo, W.S., Wedel, M.: On estimating finite mixtures of multivariate regression and simultaneous equation models. Struct. Equ. Model. 3(3), 266–289 (1996)
Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle, WA, USA 1, 285–288 (1998)
Kalaitzis, A., Lawrence, N.: Residual component analysis: Generalising pca for more flexible inference in linear-gaussian models. In: International Conference on Machine Learning, Edinburgh, Scotland, UK, (2012)
Lawrence, N.: Probabilistic non-linear principal component analysis with gaussian process latent variable models. J. Mach. Learn. Res. 6, 1783–1816 (2005)
Li, K.C.: Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86(414), 316–327 (1991)
McLachlan, G.J., Peel, D.: Robust cluster analysis via mixtures of multivariate t-distributions. In: Lecture Notes in Computer Science, pp 658–666. Springer, Berlin (1998)
McLachlan, G.J., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41(3–4), 379–388 (2003)
Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)
Meng, X.L., Van Dyk, D.: The EM algorithm: an old folk-song sung to a fast new tune. J. R. Stat. Soc. B 59(3), 511–567 (1997)
Naik, P., Tsai, C.L.: Partial least squares estimator for single-index models. J. R. Stat. Soc. B 62(4), 763–771 (2000)
Qiao, Y., Minematsu, N.: Mixture of probabilistic linear regressions: a unified view of GMM-based mapping techiques. In: IEEE international conference on acoustics, speech, and signal processing, pp 3913–3916, (2009)
Quandt, R.E., Ramsey, J.B.: Estimating mixtures of normal distributions and switching regressions. J. Am. Stat. Assoc. 73(364), 730–738 (1978)
Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection, lecture notes in computer science, vol 3940, pp 34–51. Springer, Berlin (2006)
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)
Talmon, R., Cohen, I., Gannot, S.: Supervised source localization using diffusion kernels. In: Workshop on Applications of Signal Processing to Audio and Acoustics, pp 245–248, (2011)
Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P., Cipolla, R.: Multivariate relevance vector machines for tracking. In: European conference on computer vision, pp 124–138. Springer, Heidelberg (2006)
Tipping, M.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001)
Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999a)
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. R. Stat. Soc. B 61(3), 611–622 (1999b)
Vapnik, V., Golowich, S., Smola, A.: Support vector method for function approximation, regression estimation, and signal processing. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in neural information processing, pp. 281–287. MIT Press, Cambridge (1997)
Wang, C., Neal, R.M.: Gaussian process regression with heteroscedastic or non-gaussian residuals. Computing Research Repository. (2012)
Wedel, M., Kamakura, W.A.: Factor analysis with (mixed) observed and latent variables in the exponential family. Psychometrika 66(4), 515–530 (2001)
Wu, H.: Kernel sliced inverse regression with applications to classification. J. Comput. Graph. Stat. 17(3), 590–610 (2008)
Xu, L., Jordan, M.I., Hinton, G.E.: An alternative model for mixtures of experts. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in neural information processing systems, pp. 633–640. MIT Press, Cambridge (1995)
Zhao, J.H., Yu, P.L.: Fast ML estimation for the mixture of factor analyzers via an ECM algorithm. IEEE Trans. Neural Netw. 19(11), 1956–1961 (2008)
Acknowledgments
The authors wish to thank the anonymous reviewers for their constructive remarks and suggestions which helped organizing and improving this article.
Author information
Affiliations
Corresponding author
Appendices
Appendix 1: Link between joint GMM and GLLiM
Proposition 1
A GLLiM model on \({\varvec{X}},{\varvec{Y}}\) with unconstrained parameters \(\varvec{\theta }= \{{\varvec{c}}_k,\varvec{\Gamma }_k,\pi _k,\mathbf{A}_k,\varvec{b}_k,\varvec{\Sigma }_k\}_{k=1}^K\) is equivalent to a Gaussian mixture model on the joint variable \([{\varvec{X}};{\varvec{Y}}]\) with unconstrained parameters \(\varvec{\psi }=\{{\varvec{m}}_k,\mathbf{V}_k,\rho _k\}_{k=1}^K\), i.e.,
The parameter vector \(\varvec{\theta }\) can be expressed as a function of \(\varvec{\psi }\) by:
The parameter \(\varvec{\psi }\) can be expressed as a function of \(\varvec{\theta }\) by:
Note that this proposition was proved for \(D=1\) in (Ingrassia et al. 2012), but not in the general case as proposed here.
Proof
(43) is obtained using (44) and formulas for conditional multivariate Gaussian variables. (44) is obtained from standard algebra by identifying the joint distribution \(p({\varvec{X}},{\varvec{Y}}|{\varvec{Z}};\varvec{\theta })\) defined by (3) and (4) with a multivariate Gaussian distribution. To complete the proof, one need to prove the following two statements:
-
(i)
For any \(\rho _k\in \mathbb {R}, {\varvec{m}}_k\in \mathbb {R}^{D+L}\) and \(\mathbf{V}_k\in \mathcal {S}^{L+D}_+\), there is a set of parameters \({\varvec{c}}_k\in \mathbb {R}^L,\varvec{\Gamma }_k\in \mathcal {S}^{L}_+,\pi _k\in \mathbb {R},\mathbf{A}_k\in \mathbb {R}^{D\times L},\varvec{b}_k\in \mathbb {R}^D\) and \(\varvec{\Sigma }_k\in \mathcal {S}^{D}_+\) such that (43) holds.
-
(ii)
Reciprocally, for any \({\varvec{c}}_k\in \mathbb {R}^L,\varvec{\Gamma }_k\in \mathcal {S}^{L}_+,\pi _k\in \mathbb {R},\mathbf{A}_k\in \mathbb {R}^{D\times L},\varvec{b}_k\in \mathbb {R}^D,\varvec{\Sigma }_k\in \mathcal {S}^{D}_+\) there is a set of parameters \(\rho _k\in \mathbb {R}, {\varvec{m}}_k\in \mathbb {R}^{L+D}\) and \(\mathbf{V}_k\in \mathcal {S}^{D+L}_+\) such that (44) holds,
where \(\mathcal {S}^{M}_+\) denotes the set of \(M\times M\) symmetric positive definite matrices. We introduce the following lemma. \(\square \)
Lemma 1
If
then \(\varvec{\Sigma }=\mathbf{V}^{ yy } - \mathbf{V}^{ xy \top }\mathbf{V}^{ xx -1}\mathbf{V}^{ xy }\in \mathcal {S}^{D}_+\).
Proof
Since \(\mathbf{V}\in \mathcal {S}^{L+D}_+\) we have \({\varvec{u}}^{\top }\mathbf{V}{\varvec{u}}>0\) for all non null \({\varvec{u}}\in \mathbb {R}^{L+D*}\). Using the decomposition \({\varvec{u}}=[{\varvec{u}}^{\mathrm{x}};{\varvec{u}}^{\mathrm{y}}]\) we obtain
In particular, for \({\varvec{u}}^{\mathrm{x}}=-\mathbf{V}^{\mathrm{xx}-1}{\varvec{u}}^{\mathrm{y}}\mathbf{V}^{\mathrm{xy}}\) we obtain
and hence \(\varvec{\Sigma }\in \mathcal {S}^{D}_+\). \(\square \)
Lemma 2
If \(\mathbf{A}\in \mathbb {R}^{D\times L}, \varvec{\Gamma }\in \mathcal {S}^{L}_+, \varvec{\Sigma }\in \mathcal {S}^{D}_+\), then
Proof
Since \(\varvec{\Gamma }\in \mathcal {S}^{L}_+\) there is a unique symmetric positive definite matrix \(\varvec{\Lambda }\in \mathcal {S}^{L}_+\) such that \(\varvec{\Gamma }=\varvec{\Lambda }^2\). Using standard algebra, we obtain that for all non null \({\varvec{u}}=[{\varvec{u}}^{\mathrm{x}};{\varvec{u}}^{\mathrm{y}}]\in \mathbb {R}^{L+D*}\),
where \(||.||\) denotes the standard Euclidean distance. The first term of the sum is positive for all \([{\varvec{u}}^{\mathrm{x}};{\varvec{u}}^{\mathrm{y}}]\in \mathbb {R}^{L+D*}\) and the second term strictly positive for all \({\varvec{u}}^{\mathrm{y}}\in \mathbb {R}^{D*}\) because \(\varvec{\Sigma }\in \mathcal {S}^{D}_+\) by hypothesis. Therefore, \(\mathbf{V}\in \mathcal {S}^{L+D}_+\).
Lemma 1 and the correspondence formulae (43) prove (i), Lemma 2 and the correspondence formulae (44) prove (ii), hence completing the proof.\(\square \)
Appendix 2: The marginal hybrid GLLiM-EM
By marginalizing out the hidden variables \({\varvec{W}}_{1:N}\), we obtain a different EM algorithm than the one presented in Sect. 5, with hidden variables \({\varvec{Z}}_{1:N}\) only. For a clearer connection with standard procedures we assume here, as already specified, that \({\varvec{c}}_k^{\mathrm{w}} = \varvec{0}_{L_{\mathrm{w}}}\) and \(\varvec{\Gamma }_k^{\mathrm{w}}= \mathbf{I}_{L_{\mathrm{w}}}\). The E–W-step disappears while the E–Z-step and the following updating of \(\pi _k\), \({\varvec{c}}_k^{\mathrm{t}}\) and \(\varvec{\Gamma }_k^{\mathrm{t}}\) in the M-GMM-step are exactly the same as in Sect. 5.4. However, the marginalization of \({\varvec{W}}_{1:N}\) leads to a clearer separation between the regression parameters \(\mathbf{A}_k^{\mathrm{t}}\) and \(\varvec{b}_k\) (M-regression-step) and the other parameters \(\mathbf{A}_k^{\mathrm{w}}\) and \(\varvec{\Sigma }_k\) (M-residual-step). This can be seen straightforwardly from Eq. (19) which shows that after marginalizing \({\varvec{W}}\), the model parameters separate into a standard regression part \({\varvec{A}}_k^{\mathrm{t}} {\varvec{t}}_n + \varvec{b}_k\) for which standard estimators do not involve the noise variance and a PPCA-like part on the regression residuals \({\varvec{y}}_n- \tilde{{\varvec{A}}}_k^{\mathrm{t}} {\varvec{t}}_n - \tilde{\varvec{b}}_k\), in which the non standard noise covariance \(\varvec{\Sigma }_k + \mathbf{A}_k^{\mathrm{w}} (\mathbf{A}_k^{\mathrm{w}})^{\top }\) is typically dealt with by adding a latent variable \({\varvec{W}}\).
The algorithm is therefore made of the E–Z-step and M-GMM-step detailed in 5.4, and the following M-steps:
M-regression-step:
The \(\mathbf{A}_k^{\mathrm{t}}\) and \(\varvec{b}_k\) parameters are obtained using standard weighted affine regression from \(\{{\varvec{t}}_n\}_{n=1}^N\) to \(\{{\varvec{y}}_n\}_{n=1}^N\) with weights \(\widetilde{r}_{nk}\), i.e.,
with
and with
M-residual-step:
Optimal values for \(\mathbf{A}^{\mathrm{w}}_k\) and \(\varvec{\Sigma }_k\) are obtained by minimization of the following criterion:
where \({\varvec{u}}_{kn}=\sqrt{\widetilde{r}_{nk}/\widetilde{r}_k}({\varvec{y}}_n-\widetilde{\mathbf{A}}^{\mathrm{t}}_k{\varvec{t}}_n-\widetilde{\varvec{b}}_k)\). Vectors \(\{{\varvec{u}}_{kn}\}_{n=1}^N\) can be seen as the residuals of the \(k\)-th local affine transformation. No closed-form solution exists in the general case. A first option is to make use of an inner loop such as a gradient descent technique, or to consider \(Q_k\) as the new target observed-data likelihood and use an inner EM corresponding to the general EM described in previous section with \(L_{\mathrm{t}}=0\) and \(K=1\). Another option is to use the Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin 1993) proposed by (Zhao and Yu 2008). The ECM algorithm replaces the M-step of the EM algorithm with a sequence of conditional maximization (CM) steps. Such CM steps lead, in the general case, to a conditional (to \(\varvec{\Sigma }_k\)) update of \(\mathbf{A}_k^{\mathrm{w}}\) which is similar to PPCA (Tipping and Bishop 1999b) with an isotropic noise variance provided and equal to 1. It follows very convenient closed-form expressions (Zhao and Yu 2008) as is detailed below. (Zhao and Yu 2008) shows that such an ECM algorithm is computationally more efficient than EM in the case of large sample size relative to the data dimension and that the reverse may as well be true in other situations.
However, in the particular case \(\varvec{\Sigma }_k=\sigma ^2_k\mathbf{I}_D\), we can afford a standard EM as it connects to PPCA. Indeed, one may notice that \(Q_k\) has then exactly the same form as the observed-data log-likelihood in PPCA, with parameters \((\sigma ^2_k,\mathbf{A}_k^{\mathrm{w}})\) and observations \(\{{\varvec{u}}_{kn}\}_{n=1}^N\). Denoting with \(\mathbf{C}_k= \sum _{n=1}^N {\varvec{u}}_{kn} {\varvec{u}}_{kn}^T/N\) the \(D\times D\) sample residual covariance matrix and with \(\lambda _{1k}>\dots >\lambda _{Dk}\) its eigenvalues in decreasing order, we can therefore use the key result of (Tipping and Bishop 1999b) to see that a global maximum of \(Q_k\) is obtained for
where \(\mathbf{U}_k\) denotes the \(D\times L_{\mathrm{w}}\) matrix whose column vectors are the first eigenvectors of \(\mathbf{C}_k\) and \(\varvec{\Lambda }_k\) is a \(L_{\mathrm{w}}\times L_{\mathrm{w}}\) diagonal matrix containing the corresponding first eigenvalues.
The hybrid nature of hGLLiM (at the crossroads of regression and dimensionality reduction) is striking in this variant, as it alternates between a mixture-of-Gaussians step, a local-linear-regression step and a local-linear-dimensionality-reduction step on residuals. This variant is also much easier to initialize as a set of initial posterior values \(\{r_{nk}^{(0)}\}_{n=1,k=1}^{N,K}\) can be obtained using the \(K\)-means algorithm or the standard GMM-EM algorithm on \({\varvec{t}}_{1:N}\) or on the joint data \([{\varvec{y}};{\varvec{t}}]_{1:N}\) as done in (Qiao and Minematsu 2009) before proceeding to the M-step. On the other hand, due to the time-consuming eigenvalue decomposition needed at each iteration, the marginal hGLLiM-EM turns out to be slower that the general hGLLiM-EM algorithm described in Sect. 5. We thus use the marginal algorithm as an initialization procedure for the general hGLLiM-EM algorithm.
Rights and permissions
About this article
Cite this article
Deleforge, A., Forbes, F. & Horaud, R. High-dimensional regression with gaussian mixtures and partially-latent response variables. Stat Comput 25, 893–911 (2015). https://doi.org/10.1007/s11222-014-9461-5
Received:
Accepted:
Published:
Issue Date:
Keywords
- Regression
- Latent variable
- Mixture models
- Expectation-maximization
- Dimensionality reduction