Model-based regression clustering for high-dimensional data: application to functional data

Devijver, Emilie

doi:10.1007/s11634-016-0242-1

Model-based regression clustering for high-dimensional data: application to functional data

Regular Article
Published: 14 March 2016

Volume 11, pages 243–279, (2017)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Emilie Devijver¹

923 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Finite mixture regression models are useful for modeling the relationship between response and predictors arising from different subpopulations. In this article, we study high-dimensional predictors and high-dimensional response and propose two procedures to cluster observations according to the link between predictors and the response. To reduce the dimension, we propose to use the Lasso estimator, which takes into account the sparsity and a maximum likelihood estimator penalized by the rank, to take into account the matrix structure. To choose the number of components and the sparsity level, we construct a collection of models, varying those two parameters and we select a model among this collection with a non-asymptotic criterion. We extend these procedures to functional data, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms and apply and evaluate our methods both on simulated and real datasets, to understand how they work in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse semiparametric regression when predictors are mixture of functional and high-dimensional variables

Article 30 July 2020

Semiparametric mixtures of regressions with single-index for model based clustering

Article 23 April 2020

Group linear algorithm with sparse principal decomposition: a variable selection and clustering method for generalized linear models

Article 06 May 2022

References

Anderson TW (1951) Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann Math Stat 22(3):327–351
Article MathSciNet MATH Google Scholar
Baudry J-P, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput 22(2):455–470
Article MathSciNet MATH Google Scholar
Birgé L, Massart P (2007) Minimal penalties for Gaussian model selection. Probab Theory Relat Fields 138(1–2):33–73
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications, Springer Series in Statistics. Springer, Berlin
Bunea F, She Y, Wegkamp M (2012) Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. Ann Stat 40(5):2359–2388
Article MathSciNet MATH Google Scholar
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793
Article Google Scholar
Ciarleglio A, Ogden T (2014) Wavelet-based scalar-on-function finite mixture regression models. Comput Stat Data Anal 93:86–96
Article MathSciNet Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Discussion. J R Stat Soc Ser B 39:1–38
MATH Google Scholar
Devijver E (2015) Finite mixture regression: a sparse variable selection by model selection for clustering. Electron J Stat 9(2):2642–2674
Article MathSciNet MATH Google Scholar
Devijver E (2015) Joint rank and variable selection for parsimonious estimation in high-dimension finite mixture regression model. arXiv:1501.00442
Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice, Springer series in statistics. Springer, New York
Gareth J, Sugar C (2003) Clustering for sparsely sampled functional data. J Am Stat Assoc 98(462):397–408
Article MathSciNet MATH Google Scholar
Giraud C (2011) Low rank multivariate regression. Electron J Stat 5:775–799
Article MathSciNet MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article MATH Google Scholar
Izenman A (1975) Reduced-rank regression for the multivariate linear model. J Multivar Anal 5(2):248–264
Article MathSciNet MATH Google Scholar
Jones PN, McLachlan GJ (1992) Fitting finite mixture models in a regression context. Aust J Stat 34:233–240
Article Google Scholar
Mallat S (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11:674–693
Article MATH Google Scholar
Mallat S (1999) A wavelet tour of signal processing. Academic Press, Dublin
MATH Google Scholar
McLachlan G, Peel D (2004) Finite Mixture Models, Wiley series in probability and statistics. Wiley, New York
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B Stat Methodol 72(4):417–473
Article MathSciNet Google Scholar
Meynet C, Maugis-Rabusseau C (2012) A sparse variable selection procedure in model-based clustering. Research report, September
Misiti M, Misiti Y, Oppenheim G, Poggi J-M (2004) Matlab Wavelet Toolbox User’s Guide. Version 3. The Mathworks Inc, Natick, MA
Misiti M, Misiti Y, Oppenheim G, Poggi J-M (2007) Clustering signals using wavelets. In: Sandoval Francisco, Prieto Alberto, Cabestany Joan, Graña Manuel (eds) Computational and Ambient Intelligence, vol 4507, Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pp 514–521
Park T, Casella G (2008) The Bayesian lasso. J Am Stat Assoc 103(482):681–686
Article MathSciNet MATH Google Scholar
Ramsay JO, Silverman BW (2005) Functional data analysis, Springer series in statistics. Springer, New York
Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22:231–245
Städler N, Bühlmann P, Van de Geer S (2010) $\ell _{1}$-penalization for mixture regression models. Test 19(2):209–256
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–288
MathSciNet MATH Google Scholar
Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109:475–494
Article MathSciNet MATH Google Scholar
Wu J (1983) On the convergence properties of the EM algorithm. Ann Stat 1:95–103
Article MathSciNet MATH Google Scholar
Yao F, Fu Y, Lee T (2011) Functional mixture regression. Biostatistics 2:341–353
Article Google Scholar
Zhao Y, Ogden T, Reiss P (2012) Wavelet-based LASSO in functional linear regression. J Comput Graph Stat 21(3):600–617
Article MathSciNet Google Scholar

Download references

Acknowledgments

I am indebted to Jean-Michel Poggi and Pascal Massart for suggesting me to study this problem, and for stimulating discussions. I am also grateful to Jean-Michel Poggi for carefully reading the manuscript and making many useful suggestions. I thank Yves Misiti and Benjamin Auder for their help to speed up the code. I also thank referees for very interesting improvements and suggestions, and editors for their help for writing this paper. I am also grateful to Irène Gijbels for carefully proofreading the manuscript.

Author information

Authors and Affiliations

Inria Select, Université Paris Sud, Bât. 425, 91405, Orsay Cedex, France
Emilie Devijver

Authors

Emilie Devijver
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emilie Devijver.

Appendix

In these appendices, we derive first the updating formulae in the EM algorithm in “EM algorithms”. In “Group-Lasso MLE and Group-Lasso Rank procedures”, we extend our procedures with the Group-Lasso estimator to select relevant indices, rather than using the Lasso estimator.

1.1 EM algorithms

1.1.1 EM algorithm for the Lasso estimator

Introduced by Dempster et al. (1977), the EM (Expectation-Maximization) algorithm is used to compute MLE, penalized or not.

The expected complete negative log-likelihood is denoted by

$$\begin{aligned} Q(\theta |\theta ')=-\frac{1}{n} E_{\theta '}(l_{c}(\theta ,\varvec{Y},\varvec{x},\varvec{Z})|\varvec{Y}) \end{aligned}$$

in which

$$\begin{aligned} l_{c} (\theta ,\varvec{Y} ,\varvec{x}, \varvec{Z}) =&\sum _{i=1}^{n} \sum _{k=1}^{K} [\varvec{Z}_i]_k \log \left( \frac{\det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( - \frac{1}{2} (\varvec{P}_k \varvec{Y}_i - \varvec{x}_i\varvec{\Phi }_k)^t (\varvec{P}_k \varvec{Y}_i - \varvec{x}_i\varvec{\Phi }_k) \right) \right) \\&+ [\varvec{Z}_i]_k \log (\pi _{k}) ; \end{aligned}$$

where $[\varvec{Z}_i]$ are unobserved independent and identically distributed random variables following a multinomial distribution with parameters 1 and $(\pi _1,\ldots ,\pi _K)$, describing the component-membership of the i th observation in the finite mixture regression model.

The expected complete penalized negative log-likelihood is

$$\begin{aligned} Q_{\text {pen}}(\theta |\theta ')=Q(\theta |\theta ')+ \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1. \end{aligned}$$

Calculus for updating formulae

E-step: compute $Q(\theta |\theta ^{(\text {ite})})$, or, equivalently, compute for $k \in \{1,\ldots ,K\}$, $i \in \{1,\ldots ,n\}$,
$$\begin{aligned} \varvec{\tau }^{(\text {ite})}_{i,k}&=E_{\theta ^{(\text {ite})}}([\varvec{Z}_i]_k|\varvec{Y}_i)\\&=\frac{\pi _{k}^{(\text {ite})} \det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) }{\sum _{r=1}^{K}\pi _{k}^{(\text {ite})}\det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) } \end{aligned}$$
Thanks to the MAP principle, the clustering is updated.
M-step: optimize $Q_{\text {pen}}(\theta |\theta ^{(\text {ite})})$ with respect to $\theta $. For this, rewrite the Karush–Kuhn–Tucker conditions. We have
$$\begin{aligned}&Q_\text {pen}(\theta |\theta ^{(\text {ite})}) \nonumber \\ =&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k \log \left( \frac{\det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( -\frac{1}{2} (\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k)^t(\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k) \right) \right) \left| \varvec{Y} \right. \right) \nonumber \\&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k \log \pi _{k} |\varvec{Y} \right) + \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1 \nonumber \\ =&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} -\frac{1}{2} (\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k)^t (\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k)E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k |\varvec{Y} \right) \nonumber \\&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} \sum _{m=1}^q \log \left( \frac{[\varvec{P}_k]_{m,m}}{\sqrt{2 \pi }} \right) E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k |\varvec{Y} \right) \nonumber \\&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k |\varvec{Y} \right) \log \pi _{k} + \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1. \end{aligned}$$
(14)
Firstly, we optimize this formula with respect to $\pi $: this is equivalent to optimizing
$$\begin{aligned} -\frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} \varvec{\tau }_{i,k}^{(\text {ite})} \log (\pi _{k}) + \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1. \end{aligned}$$
We obtain
$$\begin{aligned} \pi _{k}^{(\text {ite}+1)}&= \pi _{k}^{(\text {ite})}+t^{(\text {ite})} \left( \frac{\sum _{i=1}^{n} \varvec{\tau }_{i,k}^{(\text {ite})}}{n} - \pi _{k}^{(\text {ite})}\right) ; \end{aligned}$$
(15)
where $t^{(\text {ite})} \in (0,1]$ is the largest value in $\{0.1^{l}, l\in \mathbb {N} \}$ such that the update of $\pi $ defined in (15) leads to improving the expected complete penalized log-likelihood function. To optimize (14) with respect to $(\varvec{\Phi },\mathbf {P})$, we rewrite the expression: it is similar to the optimization of
$$\begin{aligned}&-\frac{1}{n} \sum _{i=1}^n \left( \varvec{\tau }_{i,k}^{(\text {ite})} \sum _{m=1}^q \log ([\varvec{P}_k]_{m,m}) - \frac{1}{2} (\varvec{P}_k \varvec{\widetilde{y}}^{(\text {ite})}_i - \varvec{\widetilde{x}}^{(\text {ite})}_i \varvec{\Phi }_k)^t(\varvec{P}_k \varvec{\widetilde{y}}^{(\text {ite})}_i - \varvec{\widetilde{x}}^{(\text {ite})}_i \varvec{\Phi }_k) \right) \\&\quad + \lambda \pi _{k} ||\varvec{\Phi }_k||_1 \end{aligned}$$
for all $k \in \{1,\ldots ,K\}$, with for all $j \in \{1, \ldots , p\}$ for all $i\in \{1,\ldots ,n\}$, for all $k \in \{1,\ldots ,K\}$,
$$\begin{aligned}&([\varvec{\widetilde{y}}^{(\text {ite})}_{i}]_{k,.},[\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.}) = \sqrt{\varvec{\tau }_{i,k}^{(\text {ite})}} (\varvec{y}_i,\varvec{x}_i). \end{aligned}$$
Remark that this is equivalent to optimizing
$$\begin{aligned}&-\frac{1}{n} n_k^{(\text {ite})} \sum _{m=1}^q \log ( [\varvec{P}_k]_{m,m}) + \frac{1}{2n} \sum _{i=1}^n \sum _{m=1}^q \left( [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - [\varvec{\Phi }_k]_{m,.} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.} \right) ^2\nonumber \\&+ \lambda \pi _{k} ||\varvec{\Phi }_k||_1 ; \end{aligned}$$
(16)
with $n_k^{(\text {ite})} = \sum _{i=1}^n \varvec{\tau }^{(\text {ite})}_{i,k}$. With respect to $[\varvec{P}_k]_{m,m}$, the optimization of (16) is the solution of
$$\begin{aligned} - \frac{n_k^{(\text {ite})}}{n} \frac{1}{[\varvec{P}_k]_{m,m}} + \frac{1}{2n} \sum _{i=1}^n 2 [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} \left( [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - [\varvec{\Phi }_k]_{m,.} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.}\right) =0 \end{aligned}$$
for all $k \in \{1,\ldots ,K\}$, for all $m \in \{1,\ldots ,q\}$, which is equivalent to
$$\begin{aligned} -1+\frac{1}{n_k^{(\text {ite})}} [\varvec{P}_k]_{m,m}^2 \sum _{i=1}^n [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m}^2 - \frac{1}{n_k^{(\text {ite})}} [\varvec{P}_k]_{m,m} \sum _{i=1}^n [\varvec{\widetilde{y}}^{(\text {ite})}_{i}]_{k,m} [\varvec{\Phi }_k]_{m,.} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.}&=0\\ \Leftrightarrow -1 + [\varvec{P}_k]_{m,m}^2 \frac{1}{n_k^{(\text {ite})}} || [{\varvec{\widetilde{y}}^{(\text {ite})}}]_{k,m} ||_2^2 - [\varvec{P}_k]_{m,m}\frac{1}{n_k^{(\text {ite})}} \langle [\widetilde{\varvec{y}}^{(\text {ite})}]_{k,m}, [\varvec{\Phi }_k]_{m,.} [\widetilde{\varvec{x}}^{(\text {ite})}]_{k,.} \rangle&=0. \end{aligned}$$
The discriminant is
$$\begin{aligned} \Delta _{k,m} = \left( -\frac{1}{n_k^{(\text {ite})}} \langle [\widetilde{\varvec{y}}^{(\text {ite})}]_{k,m},[\varvec{\Phi }_k]_{m,.} [\widetilde{\varvec{x}}^{(\text {ite})}]_{k,.} \rangle \right) ^2 - \frac{4}{n_k^{(\text {ite})}} ||[{\varvec{\widetilde{y}}^{(\text {ite})}}]_{k,m}||_2^2. \end{aligned}$$
Then, for all $k \in \{1,\ldots ,K\}$, for all $m \in \{1,\ldots ,q\}$,
$$\begin{aligned} \, [\varvec{P}_k]_{m,m}= \frac{n_k^{(\text {ite})} \langle [\widetilde{\varvec{y}}^{(\text {ite})}]_{k,m}, [\varvec{\Phi }_k]_{m,.} [\widetilde{\varvec{x}}^{(\text {ite})}]_{k,.} \rangle + \sqrt{\Delta _{k,m}}}{2 n_k^{(\text {ite})} ||[{\varvec{\widetilde{y}}^{(\text {ite})}}]_{k,m}||_2^2}. \end{aligned}$$
We also look at Eq. (14) as a function of the variable $\varvec{\Phi }$: according to the partial derivative with respect to $[\varvec{\Phi }_k]_{m,j}$, we obtain for all $m \in \{1,\ldots ,q\}$, for all $k \in \{1,\ldots ,K\}$, for all $j \in \{1,\ldots ,p\}$,
$$\begin{aligned}&\sum _{i=1}^n [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} \left( [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - \sum _{j_2 =1}^{p} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j_2} [\varvec{\Phi }_k]_{m,j_2}\right) \\&\quad -n\lambda \pi _{k} \text {sgn}([\varvec{\Phi }_k]_{m,j})=0, \end{aligned}$$
where $\text {sgn}$ is the sign function. Then, for all $k \in \{1,\ldots ,K\}, j \in \{1,\ldots ,p\}, m \in \{1,\ldots ,q\}$,
$$\begin{aligned} \, [\varvec{\Phi }_k]_{m,j} = \frac{\sum _{i=1}^{n} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - \sum _{\genfrac{}{}{0.0pt}{}{j_2=1}{j_2\ne j}}^{p} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j_2} [\varvec{\Phi }_k]_{m,j_2} - n\lambda \pi _{k} \text {sgn}([\varvec{\Phi }_k]_{j,m})}{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2}. \end{aligned}$$
Let, for all $k \in \{1,\ldots ,K\}, j \in \{1,\ldots ,p\}, m \in \{1,\ldots ,q\}$,
$$\begin{aligned} \, [\varvec{S}_{k}]_{j,m}=-\sum _{i=1}^{n} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} + \sum _{\genfrac{}{}{0.0pt}{}{j_2=1 }{ j_2\ne j}}^{p} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j_2} [\varvec{\Phi }_k]_{m,j_2}. \end{aligned}$$
Then
$$\begin{aligned}{}[\varvec{\Phi }_k]_{m,j}&= \frac{-[\varvec{S}_{k}]_{j,m}- n\lambda \pi _{k} \text {sgn}([\varvec{\Phi }_k]_{m,j})}{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2} = \left\{ \begin{array}{lll} &{}\frac{-[\varvec{S}_{k}]_{j,m}+ n\lambda \pi _{k} }{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2} &{} \quad \text {if } [\varvec{S}_{k}]_{j,m}>n \lambda \pi _{k}\\ &{}-\frac{[\varvec{S}_{k}]_{j,m}+ n\lambda \pi _{k}}{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2} &{} \quad \text {if } [\varvec{S}_{k}]_{j,m} < -n \lambda \pi _{k}\\ &{}0 &{}\quad \text {elsewhere.} \end{array} \right. \end{aligned}$$

From these equalities, we write the updating formulae.

1.1.2 EM algorithm for the rank procedure

To take into account the matrix structure, we want to make a dimension reduction on the rank of the regression matrix. If we know the clustering, we compute the low rank estimator for a linear model in each cluster.

Indeed, an estimator of fixed rank r is known in the linear regression case: denoting $A^+$ the Moore-Penrose pseudo-inverse of A and $[A]_r = U D_r V^t$ in which $D_r$ is obtained from D by setting $(D_{r})_{i,i}=0$ for $i \ge r+1$, with $U D V^t$ the singular decomposition of A, if $\varvec{Y}=\mathbf {B} \varvec{x} + \varvec{\Sigma }$, an estimator of $\mathbf {B}$ with rank r is $\hat{\mathbf {B}}_r = [(\varvec{x}^t \varvec{x})^{+} \varvec{x}^t \varvec{y}]_r$.

However, we do not know the clustering of the observations,. We will use an EM algorithm, where the E-step computes the a posteriori probability .

We suppose in this case that $\varvec{\Sigma }_k$ and $\pi _k$ are known, for all $k \in \{1,\ldots ,K\}$. We use this algorithm to determine $\varvec{\Phi }_k$, for all $k \in \{1,\ldots ,K\}$, with ranks fixed to $\varvec{R} = (R_1, \ldots , R_K)$.

E-step: compute for $k \in \{1,\ldots ,K\}$, $i \in \{1,\ldots ,n\}$,
$$\begin{aligned} \varvec{\tau }_{i,k}^{(\text {ite})}&=E_{\theta ^{(\text {ite})}}([\varvec{Z}_i]_k|Y)\\&=\frac{\pi _{k}^{(\text {ite})} \det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) }{\sum _{r=1}^{K}\pi _{k}^{(\text {ite})}\det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t \left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) }. \end{aligned}$$
M-step: assign each observation to its estimated cluster, by the MAP principle applied thanks to the E-step. We say that i originates from component number ${{ \mathrm{argmax}}_{k \in \{ 1,\ldots , K\}}}\varvec{\tau }^{(\text {ite})}_{i,k}$. Then, for all $k \in \{1,\ldots ,K\}$, we define $\widetilde{\mathbf {B}}_k^{(\text {ite})} = (\varvec{x}^t_{|k} \varvec{x}_{|k})^{-1} \varvec{x}^t_{|k} \varvec{y}_{|k}$, in which $\varvec{x}_{|k}$ and $\varvec{y}_{|k}$ correspond to the observations belonging to the cluster k. For all $k \in \{1,\ldots ,K\}$, we decompose $\widetilde{\mathbf {B}}_{k}^{(\text {ite})}$ in singular value: $\widetilde{\mathbf {B}}_{k}^{(\text {ite})} = U S V^t$. Then, we estimate the regression matrix by $\hat{\mathbf {B}}_{k}^{(\text {ite})} = U [\varvec{S}]_{R(k)} V^t$ in the cluster $k \in \{1,\ldots ,K\}$.

1.2 Group-Lasso MLE and Group-Lasso Rank procedures

One way to perform these procedures is to consider the Group-Lasso estimator rather than the Lasso estimator to select relevant indices. Indeed, this estimator is more natural according to the relevant indices definition. Nevertheless, results are very similar, because we select grouped indices by the Lasso estimator or by the Group-Lasso estimator. In this section, we describe our procedures with the Group-Lasso estimator.

1.2.1 Context-definitions

Both our procedures take advantage of the Lasso estimator to select relevant indices and to reduce the dimension in case of high-dimensional data. First, recall what is meant by relevant indices.

Definition 2

A couple $(\varvec{y}_m,\varvec{x}_j)$ is said to be irrelevant for the clustering if $[\varvec{\Phi }_{1}]_{m,j} = \ldots = [\varvec{\Phi }_{K}]_{m,j}=0$, which means that the variable $\varvec{x}_j$ does not explain the variable $\varvec{y}_m$ for the clustering process. We also say that the indices (m, j) are irrelevant if the couple $(\varvec{y}_m,\varvec{x}_j)$ is irrelevant. A relevant couple is a couple which is not irrelevant: at least in one cluster k, the coefficient $[\varvec{\Phi }_k]_{m,j}$ is not equal to zero. We denote by J the set of indices (m, j) of relevant couples $(\varvec{y}_m,\varvec{x}_j)$.

According to this definition, we introduce the Group-Lasso estimator.

Definition 3

The Group-Lasso estimator for mixture regression models with regularization parameter $\lambda \ge 0$ is defined by

$$\begin{aligned} \hat{\theta }^{\text {Group-Lasso}}_\lambda := \underset{\theta \in \Theta _K}{\mathrm{argmin}} \left\{ -\frac{1}{n} \widetilde{l}_{\lambda }(\theta ) \right\} ; \end{aligned}$$

where

$$\begin{aligned} - \frac{1}{n} \widetilde{l}_\lambda (\theta ) = -\frac{1}{n} l(\theta ) + \lambda \sum _{j=1}^{p} \sum _{m=1}^q \sqrt{k}||[\varvec{\Phi }]_{m,j}||_2 ; \end{aligned}$$

where $||[\varvec{\Phi }]_{m,j}||_2^2 = \sum _{k=1}^K |[\varvec{\Phi }_k]_{m,j}|^2$ and with $\lambda $ to specify.

This Group-Lasso estimator has the advantage to shrink to zero grouped indices rather than indices one by one. It is consistent with the relevant indices definition.

However, depending on the data, it is interesting to look for which indices are equal to zero first. One way is to extend this work with Lasso-Group-Lasso estimator, described for the example for the linear model in Simon et al. (2013).

Let us describe two additional procedures, which use the Group-Lasso estimator rather than the Lasso estimator to detect relevant indices.

1.2.2 Group-Lasso-MLE procedure

This procedure is decomposed into three main steps: we construct a collection of models, then in each model we compute the MLE and we select the best one among all the models.

The first step consists in constructing a collection of models $\{\mathcal {H}_{(K,\widetilde{J})}\}_{(K,\widetilde{J}) \in \mathcal {M}}$ in which $\mathcal {H}_{(K,\widetilde{J})}$ is defined by

$$\begin{aligned} \mathcal {H}_{(K,\widetilde{J})} = \left\{ h_{\theta }(.|x) \right\} ; \end{aligned}$$

(17)

where

$$\begin{aligned} h_{\theta }(y|x)= \sum _{k=1}^{K} \frac{\pi _{k} \det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( -\frac{(\varvec{P}_ky-\varvec{\Phi }_k^{[\widetilde{J}]} x )^t(\varvec{P}_ky-\varvec{\Phi }_k^{[\widetilde{J}]} x)}{2} \right) , \end{aligned}$$

and

$$\begin{aligned} \theta =(\pi _1,\ldots , \pi _K,\varvec{\Phi }_1,\ldots , \varvec{\Phi }_K, \varvec{P}_1,\ldots ,\varvec{P}_K) \in \Pi _K \times \left( \mathbb {R}^{q\times p} \right) ^K \times \left( \mathbb {D}_q^{++} \right) ^K. \end{aligned}$$

The collection of models is indexed by $ \mathcal {M}= \mathcal {K} \times \widetilde{\mathcal {J}}$. We denote by $\mathcal {K} \subset \mathbb {N}^*$ the admitted number of classes. We bound $\mathcal {K}$ without loss of generality. Denote also $\widetilde{\mathcal {J}}$ a collection of subsets of $\{1,\ldots ,q\} \times \{1,\ldots ,p\}$, constructed by the Group-Lasso estimator.

To detect the relevant indices and construct the set $\widetilde{J} \in \widetilde{\mathcal {J}}$, we will use the Group-Lasso estimator defined by (3). For $K \in \mathcal {K}$, we propose to construct a data-driven grid $G_K$ of regularization parameters by using the updating formulae of the mixture parameters in the EM algorithm.

Then, for each $\lambda \in G_K$, we compute the Group-Lasso estimator defined by

$$\begin{aligned} \hat{\theta }^{\text {Group-Lasso}}_\lambda = \underset{\theta \in \Theta _K}{ \mathrm{argmin}} \left\{ -\frac{1}{n} \sum _{i=1}^n \log (h_\theta (\varvec{y}_i|\varvec{x}_i)) + \lambda \sum _{j=1}^{p} \sum _{m=1}^q \sqrt{K}||[\varvec{\Phi }]_{m,j}||_2 \right\} . \end{aligned}$$

For a fixed number of mixture components $K \in \mathcal {K}$ and a regularization parameter $\lambda $, we use a generalized EM algorithm to approximate this estimator. Then, for each $K \in \mathcal {K}$ and for each $\lambda \in G_K$, we construct the set of relevant indices $\widetilde{J}_\lambda $. We denote by $\widetilde{\mathcal {J}}$ the collection of all these sets.

The second step consists in approximating the MLE

$$\begin{aligned} \hat{h}^{(K,\widetilde{J})}= \underset{t \in \mathcal {H}_{(K,\widetilde{J})}}{ \mathrm{argmin}} \left\{ -\frac{1}{n} \sum _{i=1}^n \log (t(\varvec{y}_i|\varvec{x}_i)) \right\} ; \end{aligned}$$

using the EM algorithm for each model $(K,\widetilde{J})\in \mathcal {M}$.

The third step is devoted to model selection, for which we use the slope heuristic described in Birgé and Massart (2007).

1.2.3 Group-Lasso-Rank procedure

We propose a second procedure to take into account the matrix structure. For each model belonging to the collection $\mathcal {H}_{(K,\widetilde{J})}$, a subcollection is constructed, varying the rank of $\varvec{\Phi }$. We describe this procedure in more details.

As in the Group-Lasso-MLE procedure, we first construct a collection of models, thanks to the $\ell _1$-approach. We obtain an estimator for $\theta $, denoted by $\hat{\theta }^{\text {Group-Lasso}}_\lambda $, for each model belonging to the collection. We deduce the set of relevant indices, denoted by $\widetilde{J}$ and this for all $K \in \mathcal {K}$: we deduce $\widetilde{\mathcal {J}}$ the collection of set of relevant indices.

The second step consists in constructing a subcollection of models with rank sparsity, denoted by

$$\begin{aligned} \{\check{\mathcal {H}}_{(K,\widetilde{J},R)}\}_{(K,\widetilde{J},R) \in \widetilde{\mathcal {M}}}. \end{aligned}$$

The model $\{\check{\mathcal {H}}_{(K,\widetilde{J},R)}\}$ has K components, the set $\widetilde{J}$ for relevant indices and R is the vector of the ranks of the matrix of regression coefficients in each group:

$$\begin{aligned} \check{\mathcal {H}}_{(K,\widetilde{J},R)}= \left\{ h^{(K,\widetilde{J},R)}_{\theta }(\varvec{y}|\varvec{x}) \right\} \end{aligned}$$

(18)

where

$$\begin{aligned} h^{(K,\widetilde{J},R)}_{\theta }(\varvec{y}|\varvec{x})&= \sum _{k=1}^{K} \frac{\pi _{k} \det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( -\frac{(\varvec{P}_k\varvec{y}-(\varvec{\Phi }_k^{R_k})^{[\widetilde{J}]} \varvec{x} )^t(\varvec{P}_k\varvec{y}-(\varvec{\Phi }_k^{R_k})^{[\widetilde{J}]} \varvec{x} )}{2} \right) ;\\ \theta&=(\pi _1,\ldots , \pi _K,\varvec{\Phi }_1^{R_1},\ldots , \varvec{\Phi }_K^{R_K}, \varvec{P}_1,\ldots ,\varvec{P}_K) \in \Pi _K \times \Psi _K^R \times \left( \mathbb {R}_+^{q} \right) ^K ;\\ \Psi _K^R&= \left\{ (\varvec{\Phi }_1^{R_1},\ldots , \varvec{\Phi }_K^{R_K}) \in \left( \mathbb {R}^{q\times p} \right) ^K | \text {Rank}(\varvec{\Phi }_1) = R_1,\ldots , \text {Rank}(\varvec{\Phi }_K)=R_K \right\} ; \end{aligned}$$

and $\widetilde{\mathcal {M}}^R = \mathcal {K} \times \widetilde{\mathcal {J}} \times \mathcal {R}$. Denote by $\mathcal {K}\subset \mathbb {N}^*$ the possible number of components, $\widetilde{\mathcal {J}}$ a collection of subsets of $\{1,\ldots ,q\} \times \{1,\ldots ,p\}$ and $\mathcal {R}$ the set of vectors of size $K \in \mathcal {K}$ with rank values for each mean matrix. We compute the MLE under the rank constraint thanks to an EM algorithm. Indeed, we constrain the estimation of $\varvec{\Phi }_k$, for all k, to have a rank equal to $R_k$, in keeping only the $R_k$th largest singular values. More details are given in “EM algorithm for the rank procedure”. This finally leads to an estimator of the mean with row sparsity and low rank for each model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Devijver, E. Model-based regression clustering for high-dimensional data: application to functional data. Adv Data Anal Classif 11, 243–279 (2017). https://doi.org/10.1007/s11634-016-0242-1

Download citation

Received: 30 April 2015
Revised: 19 January 2016
Accepted: 26 February 2016
Published: 14 March 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11634-016-0242-1

Keywords

Mathematics Subject Classification

62H30

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-based regression clustering for high-dimensional data: application to functional data

Abstract

Access this article

Similar content being viewed by others

Sparse semiparametric regression when predictors are mixture of functional and high-dimensional variables

Semiparametric mixtures of regressions with single-index for model based clustering

Group linear algorithm with sparse principal decomposition: a variable selection and clustering method for generalized linear models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 EM algorithms

1.1.1 EM algorithm for the Lasso estimator

1.1.2 EM algorithm for the rank procedure

1.2 Group-Lasso MLE and Group-Lasso Rank procedures

1.2.1 Context-definitions

Definition 2

Definition 3

1.2.2 Group-Lasso-MLE procedure

1.2.3 Group-Lasso-Rank procedure

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Model-based regression clustering for high-dimensional data: application to functional data

Abstract

Access this article

Similar content being viewed by others

Sparse semiparametric regression when predictors are mixture of functional and high-dimensional variables

Semiparametric mixtures of regressions with single-index for model based clustering

Group linear algorithm with sparse principal decomposition: a variable selection and clustering method for generalized linear models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 EM algorithms

1.1.1 EM algorithm for the Lasso estimator

1.1.2 EM algorithm for the rank procedure

1.2 Group-Lasso MLE and Group-Lasso Rank procedures

1.2.1 Context-definitions

Definition 2

Definition 3

1.2.2 Group-Lasso-MLE procedure

1.2.3 Group-Lasso-Rank procedure

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation