Skip to main content
Log in

Model-based regression clustering for high-dimensional data: application to functional data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Finite mixture regression models are useful for modeling the relationship between response and predictors arising from different subpopulations. In this article, we study high-dimensional predictors and high-dimensional response and propose two procedures to cluster observations according to the link between predictors and the response. To reduce the dimension, we propose to use the Lasso estimator, which takes into account the sparsity and a maximum likelihood estimator penalized by the rank, to take into account the matrix structure. To choose the number of components and the sparsity level, we construct a collection of models, varying those two parameters and we select a model among this collection with a non-asymptotic criterion. We extend these procedures to functional data, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms and apply and evaluate our methods both on simulated and real datasets, to understand how they work in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  • Anderson TW (1951) Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann Math Stat 22(3):327–351

    Article  MathSciNet  MATH  Google Scholar 

  • Baudry J-P, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput 22(2):455–470

    Article  MathSciNet  MATH  Google Scholar 

  • Birgé L, Massart P (2007) Minimal penalties for Gaussian model selection. Probab Theory Relat Fields 138(1–2):33–73

  • Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications, Springer Series in Statistics. Springer, Berlin

  • Bunea F, She Y, Wegkamp M (2012) Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. Ann Stat 40(5):2359–2388

    Article  MathSciNet  MATH  Google Scholar 

  • Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793

    Article  Google Scholar 

  • Ciarleglio A, Ogden T (2014) Wavelet-based scalar-on-function finite mixture regression models. Comput Stat Data Anal 93:86–96

    Article  MathSciNet  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Discussion. J R Stat Soc Ser B 39:1–38

    MATH  Google Scholar 

  • Devijver E (2015) Finite mixture regression: a sparse variable selection by model selection for clustering. Electron J Stat 9(2):2642–2674

    Article  MathSciNet  MATH  Google Scholar 

  • Devijver E (2015) Joint rank and variable selection for parsimonious estimation in high-dimension finite mixture regression model. arXiv:1501.00442

  • Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice, Springer series in statistics. Springer, New York

  • Gareth J, Sugar C (2003) Clustering for sparsely sampled functional data. J Am Stat Assoc 98(462):397–408

    Article  MathSciNet  MATH  Google Scholar 

  • Giraud C (2011) Low rank multivariate regression. Electron J Stat 5:775–799

    Article  MathSciNet  MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  MATH  Google Scholar 

  • Izenman A (1975) Reduced-rank regression for the multivariate linear model. J Multivar Anal 5(2):248–264

    Article  MathSciNet  MATH  Google Scholar 

  • Jones PN, McLachlan GJ (1992) Fitting finite mixture models in a regression context. Aust J Stat 34:233–240

    Article  Google Scholar 

  • Mallat S (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11:674–693

    Article  MATH  Google Scholar 

  • Mallat S (1999) A wavelet tour of signal processing. Academic Press, Dublin

    MATH  Google Scholar 

  • McLachlan G, Peel D (2004) Finite Mixture Models, Wiley series in probability and statistics. Wiley, New York

  • Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B Stat Methodol 72(4):417–473

    Article  MathSciNet  Google Scholar 

  • Meynet C, Maugis-Rabusseau C (2012) A sparse variable selection procedure in model-based clustering. Research report, September

  • Misiti M, Misiti Y, Oppenheim G, Poggi J-M (2004) Matlab Wavelet Toolbox User’s Guide. Version 3. The Mathworks Inc, Natick, MA

  • Misiti M, Misiti Y, Oppenheim G, Poggi J-M (2007) Clustering signals using wavelets. In: Sandoval Francisco, Prieto Alberto, Cabestany Joan, Graña Manuel (eds) Computational and Ambient Intelligence, vol 4507, Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pp 514–521

  • Park T, Casella G (2008) The Bayesian lasso. J Am Stat Assoc 103(482):681–686

    Article  MathSciNet  MATH  Google Scholar 

  • Ramsay JO, Silverman BW (2005) Functional data analysis, Springer series in statistics. Springer, New York

  • Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22:231–245

  • Städler N, Bühlmann P, Van de Geer S (2010) \(\ell _{1}\)-penalization for mixture regression models. Test 19(2):209–256

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109:475–494

    Article  MathSciNet  MATH  Google Scholar 

  • Wu J (1983) On the convergence properties of the EM algorithm. Ann Stat 1:95–103

    Article  MathSciNet  MATH  Google Scholar 

  • Yao F, Fu Y, Lee T (2011) Functional mixture regression. Biostatistics 2:341–353

    Article  Google Scholar 

  • Zhao Y, Ogden T, Reiss P (2012) Wavelet-based LASSO in functional linear regression. J Comput Graph Stat 21(3):600–617

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

I am indebted to Jean-Michel Poggi and Pascal Massart for suggesting me to study this problem, and for stimulating discussions. I am also grateful to Jean-Michel Poggi for carefully reading the manuscript and making many useful suggestions. I thank Yves Misiti and Benjamin Auder for their help to speed up the code. I also thank referees for very interesting improvements and suggestions, and editors for their help for writing this paper. I am also grateful to Irène Gijbels for carefully proofreading the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emilie Devijver.

Appendix

Appendix

In these appendices, we derive first the updating formulae in the EM algorithm in “EM algorithms”. In “Group-Lasso MLE and Group-Lasso Rank procedures”, we extend our procedures with the Group-Lasso estimator to select relevant indices, rather than using the Lasso estimator.

1.1 EM algorithms

1.1.1 EM algorithm for the Lasso estimator

Introduced by Dempster et al. (1977), the EM (Expectation-Maximization) algorithm is used to compute MLE, penalized or not.

The expected complete negative log-likelihood is denoted by

$$\begin{aligned} Q(\theta |\theta ')=-\frac{1}{n} E_{\theta '}(l_{c}(\theta ,\varvec{Y},\varvec{x},\varvec{Z})|\varvec{Y}) \end{aligned}$$

in which

$$\begin{aligned} l_{c} (\theta ,\varvec{Y} ,\varvec{x}, \varvec{Z}) =&\sum _{i=1}^{n} \sum _{k=1}^{K} [\varvec{Z}_i]_k \log \left( \frac{\det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( - \frac{1}{2} (\varvec{P}_k \varvec{Y}_i - \varvec{x}_i\varvec{\Phi }_k)^t (\varvec{P}_k \varvec{Y}_i - \varvec{x}_i\varvec{\Phi }_k) \right) \right) \\&+ [\varvec{Z}_i]_k \log (\pi _{k}) ; \end{aligned}$$

where \([\varvec{Z}_i]\) are unobserved independent and identically distributed random variables following a multinomial distribution with parameters 1 and \((\pi _1,\ldots ,\pi _K)\), describing the component-membership of the i th observation in the finite mixture regression model.

The expected complete penalized negative log-likelihood is

$$\begin{aligned} Q_{\text {pen}}(\theta |\theta ')=Q(\theta |\theta ')+ \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1. \end{aligned}$$

Calculus for updating formulae

  • E-step: compute \(Q(\theta |\theta ^{(\text {ite})})\), or, equivalently, compute for \(k \in \{1,\ldots ,K\}\), \(i \in \{1,\ldots ,n\}\),

    $$\begin{aligned} \varvec{\tau }^{(\text {ite})}_{i,k}&=E_{\theta ^{(\text {ite})}}([\varvec{Z}_i]_k|\varvec{Y}_i)\\&=\frac{\pi _{k}^{(\text {ite})} \det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) }{\sum _{r=1}^{K}\pi _{k}^{(\text {ite})}\det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) } \end{aligned}$$

    Thanks to the MAP principle, the clustering is updated.

  • M-step: optimize \(Q_{\text {pen}}(\theta |\theta ^{(\text {ite})})\) with respect to \(\theta \). For this, rewrite the Karush–Kuhn–Tucker conditions. We have

    $$\begin{aligned}&Q_\text {pen}(\theta |\theta ^{(\text {ite})}) \nonumber \\ =&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k \log \left( \frac{\det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( -\frac{1}{2} (\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k)^t(\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k) \right) \right) \left| \varvec{Y} \right. \right) \nonumber \\&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k \log \pi _{k} |\varvec{Y} \right) + \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1 \nonumber \\ =&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} -\frac{1}{2} (\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k)^t (\varvec{P}_k \varvec{y}_i - \varvec{x}_i \varvec{\Phi }_k)E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k |\varvec{Y} \right) \nonumber \\&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} \sum _{m=1}^q \log \left( \frac{[\varvec{P}_k]_{m,m}}{\sqrt{2 \pi }} \right) E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k |\varvec{Y} \right) \nonumber \\&- \frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} E_{\theta ^{(\text {ite})}} \left( [\varvec{Z}_i]_k |\varvec{Y} \right) \log \pi _{k} + \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1. \end{aligned}$$
    (14)

    Firstly, we optimize this formula with respect to \(\pi \): this is equivalent to optimizing

    $$\begin{aligned} -\frac{1}{n} \sum _{i=1}^{n} \sum _{k=1}^{K} \varvec{\tau }_{i,k}^{(\text {ite})} \log (\pi _{k}) + \lambda \sum _{k=1}^{K} \pi _{k} ||\varvec{\Phi }_k||_1. \end{aligned}$$

    We obtain

    $$\begin{aligned} \pi _{k}^{(\text {ite}+1)}&= \pi _{k}^{(\text {ite})}+t^{(\text {ite})} \left( \frac{\sum _{i=1}^{n} \varvec{\tau }_{i,k}^{(\text {ite})}}{n} - \pi _{k}^{(\text {ite})}\right) ; \end{aligned}$$
    (15)

    where \(t^{(\text {ite})} \in (0,1]\) is the largest value in \(\{0.1^{l}, l\in \mathbb {N} \}\) such that the update of \(\pi \) defined in (15) leads to improving the expected complete penalized log-likelihood function. To optimize (14) with respect to \((\varvec{\Phi },\mathbf {P})\), we rewrite the expression: it is similar to the optimization of

    $$\begin{aligned}&-\frac{1}{n} \sum _{i=1}^n \left( \varvec{\tau }_{i,k}^{(\text {ite})} \sum _{m=1}^q \log ([\varvec{P}_k]_{m,m}) - \frac{1}{2} (\varvec{P}_k \varvec{\widetilde{y}}^{(\text {ite})}_i - \varvec{\widetilde{x}}^{(\text {ite})}_i \varvec{\Phi }_k)^t(\varvec{P}_k \varvec{\widetilde{y}}^{(\text {ite})}_i - \varvec{\widetilde{x}}^{(\text {ite})}_i \varvec{\Phi }_k) \right) \\&\quad + \lambda \pi _{k} ||\varvec{\Phi }_k||_1 \end{aligned}$$

    for all \(k \in \{1,\ldots ,K\}\), with for all \(j \in \{1, \ldots , p\}\) for all \(i\in \{1,\ldots ,n\}\), for all \(k \in \{1,\ldots ,K\}\),

    $$\begin{aligned}&([\varvec{\widetilde{y}}^{(\text {ite})}_{i}]_{k,.},[\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.}) = \sqrt{\varvec{\tau }_{i,k}^{(\text {ite})}} (\varvec{y}_i,\varvec{x}_i). \end{aligned}$$

    Remark that this is equivalent to optimizing

    $$\begin{aligned}&-\frac{1}{n} n_k^{(\text {ite})} \sum _{m=1}^q \log ( [\varvec{P}_k]_{m,m}) + \frac{1}{2n} \sum _{i=1}^n \sum _{m=1}^q \left( [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - [\varvec{\Phi }_k]_{m,.} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.} \right) ^2\nonumber \\&+ \lambda \pi _{k} ||\varvec{\Phi }_k||_1 ; \end{aligned}$$
    (16)

    with \(n_k^{(\text {ite})} = \sum _{i=1}^n \varvec{\tau }^{(\text {ite})}_{i,k}\). With respect to \([\varvec{P}_k]_{m,m}\), the optimization of (16) is the solution of

    $$\begin{aligned} - \frac{n_k^{(\text {ite})}}{n} \frac{1}{[\varvec{P}_k]_{m,m}} + \frac{1}{2n} \sum _{i=1}^n 2 [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} \left( [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - [\varvec{\Phi }_k]_{m,.} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.}\right) =0 \end{aligned}$$

    for all \(k \in \{1,\ldots ,K\}\), for all \(m \in \{1,\ldots ,q\}\), which is equivalent to

    $$\begin{aligned} -1+\frac{1}{n_k^{(\text {ite})}} [\varvec{P}_k]_{m,m}^2 \sum _{i=1}^n [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m}^2 - \frac{1}{n_k^{(\text {ite})}} [\varvec{P}_k]_{m,m} \sum _{i=1}^n [\varvec{\widetilde{y}}^{(\text {ite})}_{i}]_{k,m} [\varvec{\Phi }_k]_{m,.} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,.}&=0\\ \Leftrightarrow -1 + [\varvec{P}_k]_{m,m}^2 \frac{1}{n_k^{(\text {ite})}} || [{\varvec{\widetilde{y}}^{(\text {ite})}}]_{k,m} ||_2^2 - [\varvec{P}_k]_{m,m}\frac{1}{n_k^{(\text {ite})}} \langle [\widetilde{\varvec{y}}^{(\text {ite})}]_{k,m}, [\varvec{\Phi }_k]_{m,.} [\widetilde{\varvec{x}}^{(\text {ite})}]_{k,.} \rangle&=0. \end{aligned}$$

    The discriminant is

    $$\begin{aligned} \Delta _{k,m} = \left( -\frac{1}{n_k^{(\text {ite})}} \langle [\widetilde{\varvec{y}}^{(\text {ite})}]_{k,m},[\varvec{\Phi }_k]_{m,.} [\widetilde{\varvec{x}}^{(\text {ite})}]_{k,.} \rangle \right) ^2 - \frac{4}{n_k^{(\text {ite})}} ||[{\varvec{\widetilde{y}}^{(\text {ite})}}]_{k,m}||_2^2. \end{aligned}$$

    Then, for all \(k \in \{1,\ldots ,K\}\), for all \(m \in \{1,\ldots ,q\}\),

    $$\begin{aligned} \, [\varvec{P}_k]_{m,m}= \frac{n_k^{(\text {ite})} \langle [\widetilde{\varvec{y}}^{(\text {ite})}]_{k,m}, [\varvec{\Phi }_k]_{m,.} [\widetilde{\varvec{x}}^{(\text {ite})}]_{k,.} \rangle + \sqrt{\Delta _{k,m}}}{2 n_k^{(\text {ite})} ||[{\varvec{\widetilde{y}}^{(\text {ite})}}]_{k,m}||_2^2}. \end{aligned}$$

    We also look at Eq. (14) as a function of the variable \(\varvec{\Phi }\): according to the partial derivative with respect to \([\varvec{\Phi }_k]_{m,j}\), we obtain for all \(m \in \{1,\ldots ,q\}\), for all \(k \in \{1,\ldots ,K\}\), for all \(j \in \{1,\ldots ,p\}\),

    $$\begin{aligned}&\sum _{i=1}^n [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} \left( [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - \sum _{j_2 =1}^{p} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j_2} [\varvec{\Phi }_k]_{m,j_2}\right) \\&\quad -n\lambda \pi _{k} \text {sgn}([\varvec{\Phi }_k]_{m,j})=0, \end{aligned}$$

    where \(\text {sgn}\) is the sign function. Then, for all \(k \in \{1,\ldots ,K\}, j \in \{1,\ldots ,p\}, m \in \{1,\ldots ,q\}\),

    $$\begin{aligned} \, [\varvec{\Phi }_k]_{m,j} = \frac{\sum _{i=1}^{n} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} - \sum _{\genfrac{}{}{0.0pt}{}{j_2=1}{j_2\ne j}}^{p} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j_2} [\varvec{\Phi }_k]_{m,j_2} - n\lambda \pi _{k} \text {sgn}([\varvec{\Phi }_k]_{j,m})}{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2}. \end{aligned}$$

    Let, for all \(k \in \{1,\ldots ,K\}, j \in \{1,\ldots ,p\}, m \in \{1,\ldots ,q\}\),

    $$\begin{aligned} \, [\varvec{S}_{k}]_{j,m}=-\sum _{i=1}^{n} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{P}_k]_{m,m} [\varvec{\widetilde{y}}^{(\text {ite})}_i]_{k,m} + \sum _{\genfrac{}{}{0.0pt}{}{j_2=1 }{ j_2\ne j}}^{p} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j} [\varvec{\widetilde{x}}^{(\text {ite})}_i]_{k,j_2} [\varvec{\Phi }_k]_{m,j_2}. \end{aligned}$$

    Then

    $$\begin{aligned}{}[\varvec{\Phi }_k]_{m,j}&= \frac{-[\varvec{S}_{k}]_{j,m}- n\lambda \pi _{k} \text {sgn}([\varvec{\Phi }_k]_{m,j})}{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2} = \left\{ \begin{array}{lll} &{}\frac{-[\varvec{S}_{k}]_{j,m}+ n\lambda \pi _{k} }{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2} &{} \quad \text {if } [\varvec{S}_{k}]_{j,m}>n \lambda \pi _{k}\\ &{}-\frac{[\varvec{S}_{k}]_{j,m}+ n\lambda \pi _{k}}{||[{\varvec{\widetilde{x}}^{(\text {ite})}}]_{k,j}||_{2}^2} &{} \quad \text {if } [\varvec{S}_{k}]_{j,m} < -n \lambda \pi _{k}\\ &{}0 &{}\quad \text {elsewhere.} \end{array} \right. \end{aligned}$$

From these equalities, we write the updating formulae.

1.1.2 EM algorithm for the rank procedure

To take into account the matrix structure, we want to make a dimension reduction on the rank of the regression matrix. If we know the clustering, we compute the low rank estimator for a linear model in each cluster.

Indeed, an estimator of fixed rank r is known in the linear regression case: denoting \(A^+\) the Moore-Penrose pseudo-inverse of A and \([A]_r = U D_r V^t\) in which \(D_r\) is obtained from D by setting \((D_{r})_{i,i}=0\) for \(i \ge r+1\), with \(U D V^t\) the singular decomposition of A, if \(\varvec{Y}=\mathbf {B} \varvec{x} + \varvec{\Sigma }\), an estimator of \(\mathbf {B}\) with rank r is \(\hat{\mathbf {B}}_r = [(\varvec{x}^t \varvec{x})^{+} \varvec{x}^t \varvec{y}]_r\).

However, we do not know the clustering of the observations,. We will use an EM algorithm, where the E-step computes the a posteriori probability .

We suppose in this case that \(\varvec{\Sigma }_k\) and \(\pi _k\) are known, for all \(k \in \{1,\ldots ,K\}\). We use this algorithm to determine \(\varvec{\Phi }_k\), for all \(k \in \{1,\ldots ,K\}\), with ranks fixed to \(\varvec{R} = (R_1, \ldots , R_K)\).

  • E-step: compute for \(k \in \{1,\ldots ,K\}\), \(i \in \{1,\ldots ,n\}\),

    $$\begin{aligned} \varvec{\tau }_{i,k}^{(\text {ite})}&=E_{\theta ^{(\text {ite})}}([\varvec{Z}_i]_k|Y)\\&=\frac{\pi _{k}^{(\text {ite})} \det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) }{\sum _{r=1}^{K}\pi _{k}^{(\text {ite})}\det \varvec{P}_k^{(\text {ite})} \exp \left( -\frac{1}{2}\left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) ^t \left( \varvec{P}_k^{(\text {ite})} \varvec{y}_i-\varvec{x}_i \varvec{\Phi }_k^{(\text {ite})}\right) \right) }. \end{aligned}$$
  • M-step: assign each observation to its estimated cluster, by the MAP principle applied thanks to the E-step. We say that i originates from component number \({{ \mathrm{argmax}}_{k \in \{ 1,\ldots , K\}}}\varvec{\tau }^{(\text {ite})}_{i,k}\). Then, for all \(k \in \{1,\ldots ,K\}\), we define \(\widetilde{\mathbf {B}}_k^{(\text {ite})} = (\varvec{x}^t_{|k} \varvec{x}_{|k})^{-1} \varvec{x}^t_{|k} \varvec{y}_{|k}\), in which \(\varvec{x}_{|k}\) and \(\varvec{y}_{|k}\) correspond to the observations belonging to the cluster k. For all \(k \in \{1,\ldots ,K\}\), we decompose \(\widetilde{\mathbf {B}}_{k}^{(\text {ite})}\) in singular value: \(\widetilde{\mathbf {B}}_{k}^{(\text {ite})} = U S V^t\). Then, we estimate the regression matrix by \(\hat{\mathbf {B}}_{k}^{(\text {ite})} = U [\varvec{S}]_{R(k)} V^t\) in the cluster \(k \in \{1,\ldots ,K\}\).

1.2 Group-Lasso MLE and Group-Lasso Rank procedures

One way to perform these procedures is to consider the Group-Lasso estimator rather than the Lasso estimator to select relevant indices. Indeed, this estimator is more natural according to the relevant indices definition. Nevertheless, results are very similar, because we select grouped indices by the Lasso estimator or by the Group-Lasso estimator. In this section, we describe our procedures with the Group-Lasso estimator.

1.2.1 Context-definitions

Both our procedures take advantage of the Lasso estimator to select relevant indices and to reduce the dimension in case of high-dimensional data. First, recall what is meant by relevant indices.

Definition 2

A couple \((\varvec{y}_m,\varvec{x}_j)\) is said to be irrelevant for the clustering if \([\varvec{\Phi }_{1}]_{m,j} = \ldots = [\varvec{\Phi }_{K}]_{m,j}=0\), which means that the variable \(\varvec{x}_j\) does not explain the variable \(\varvec{y}_m\) for the clustering process. We also say that the indices (mj) are irrelevant if the couple \((\varvec{y}_m,\varvec{x}_j)\) is irrelevant. A relevant couple is a couple which is not irrelevant: at least in one cluster k, the coefficient \([\varvec{\Phi }_k]_{m,j}\) is not equal to zero. We denote by J the set of indices (mj) of relevant couples \((\varvec{y}_m,\varvec{x}_j)\).

According to this definition, we introduce the Group-Lasso estimator.

Definition 3

The Group-Lasso estimator for mixture regression models with regularization parameter \(\lambda \ge 0\) is defined by

$$\begin{aligned} \hat{\theta }^{\text {Group-Lasso}}_\lambda := \underset{\theta \in \Theta _K}{\mathrm{argmin}} \left\{ -\frac{1}{n} \widetilde{l}_{\lambda }(\theta ) \right\} ; \end{aligned}$$

where

$$\begin{aligned} - \frac{1}{n} \widetilde{l}_\lambda (\theta ) = -\frac{1}{n} l(\theta ) + \lambda \sum _{j=1}^{p} \sum _{m=1}^q \sqrt{k}||[\varvec{\Phi }]_{m,j}||_2 ; \end{aligned}$$

where \(||[\varvec{\Phi }]_{m,j}||_2^2 = \sum _{k=1}^K |[\varvec{\Phi }_k]_{m,j}|^2\) and with \(\lambda \) to specify.

This Group-Lasso estimator has the advantage to shrink to zero grouped indices rather than indices one by one. It is consistent with the relevant indices definition.

However, depending on the data, it is interesting to look for which indices are equal to zero first. One way is to extend this work with Lasso-Group-Lasso estimator, described for the example for the linear model in Simon et al. (2013).

Let us describe two additional procedures, which use the Group-Lasso estimator rather than the Lasso estimator to detect relevant indices.

1.2.2 Group-Lasso-MLE procedure

This procedure is decomposed into three main steps: we construct a collection of models, then in each model we compute the MLE and we select the best one among all the models.

The first step consists in constructing a collection of models \(\{\mathcal {H}_{(K,\widetilde{J})}\}_{(K,\widetilde{J}) \in \mathcal {M}}\) in which \(\mathcal {H}_{(K,\widetilde{J})}\) is defined by

$$\begin{aligned} \mathcal {H}_{(K,\widetilde{J})} = \left\{ h_{\theta }(.|x) \right\} ; \end{aligned}$$
(17)

where

$$\begin{aligned} h_{\theta }(y|x)= \sum _{k=1}^{K} \frac{\pi _{k} \det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( -\frac{(\varvec{P}_ky-\varvec{\Phi }_k^{[\widetilde{J}]} x )^t(\varvec{P}_ky-\varvec{\Phi }_k^{[\widetilde{J}]} x)}{2} \right) , \end{aligned}$$

and

$$\begin{aligned} \theta =(\pi _1,\ldots , \pi _K,\varvec{\Phi }_1,\ldots , \varvec{\Phi }_K, \varvec{P}_1,\ldots ,\varvec{P}_K) \in \Pi _K \times \left( \mathbb {R}^{q\times p} \right) ^K \times \left( \mathbb {D}_q^{++} \right) ^K. \end{aligned}$$

The collection of models is indexed by \( \mathcal {M}= \mathcal {K} \times \widetilde{\mathcal {J}}\). We denote by \(\mathcal {K} \subset \mathbb {N}^*\) the admitted number of classes. We bound \(\mathcal {K}\) without loss of generality. Denote also \(\widetilde{\mathcal {J}}\) a collection of subsets of \(\{1,\ldots ,q\} \times \{1,\ldots ,p\}\), constructed by the Group-Lasso estimator.

To detect the relevant indices and construct the set \(\widetilde{J} \in \widetilde{\mathcal {J}}\), we will use the Group-Lasso estimator defined by (3). For \(K \in \mathcal {K}\), we propose to construct a data-driven grid \(G_K\) of regularization parameters by using the updating formulae of the mixture parameters in the EM algorithm.

Then, for each \(\lambda \in G_K\), we compute the Group-Lasso estimator defined by

$$\begin{aligned} \hat{\theta }^{\text {Group-Lasso}}_\lambda = \underset{\theta \in \Theta _K}{ \mathrm{argmin}} \left\{ -\frac{1}{n} \sum _{i=1}^n \log (h_\theta (\varvec{y}_i|\varvec{x}_i)) + \lambda \sum _{j=1}^{p} \sum _{m=1}^q \sqrt{K}||[\varvec{\Phi }]_{m,j}||_2 \right\} . \end{aligned}$$

For a fixed number of mixture components \(K \in \mathcal {K}\) and a regularization parameter \(\lambda \), we use a generalized EM algorithm to approximate this estimator. Then, for each \(K \in \mathcal {K}\) and for each \(\lambda \in G_K\), we construct the set of relevant indices \(\widetilde{J}_\lambda \). We denote by \(\widetilde{\mathcal {J}}\) the collection of all these sets.

The second step consists in approximating the MLE

$$\begin{aligned} \hat{h}^{(K,\widetilde{J})}= \underset{t \in \mathcal {H}_{(K,\widetilde{J})}}{ \mathrm{argmin}} \left\{ -\frac{1}{n} \sum _{i=1}^n \log (t(\varvec{y}_i|\varvec{x}_i)) \right\} ; \end{aligned}$$

using the EM algorithm for each model \((K,\widetilde{J})\in \mathcal {M}\).

The third step is devoted to model selection, for which we use the slope heuristic described in Birgé and Massart (2007).

1.2.3 Group-Lasso-Rank procedure

We propose a second procedure to take into account the matrix structure. For each model belonging to the collection \(\mathcal {H}_{(K,\widetilde{J})}\), a subcollection is constructed, varying the rank of \(\varvec{\Phi }\). We describe this procedure in more details.

As in the Group-Lasso-MLE procedure, we first construct a collection of models, thanks to the \(\ell _1\)-approach. We obtain an estimator for \(\theta \), denoted by \(\hat{\theta }^{\text {Group-Lasso}}_\lambda \), for each model belonging to the collection. We deduce the set of relevant indices, denoted by \(\widetilde{J}\) and this for all \(K \in \mathcal {K}\): we deduce \(\widetilde{\mathcal {J}}\) the collection of set of relevant indices.

The second step consists in constructing a subcollection of models with rank sparsity, denoted by

$$\begin{aligned} \{\check{\mathcal {H}}_{(K,\widetilde{J},R)}\}_{(K,\widetilde{J},R) \in \widetilde{\mathcal {M}}}. \end{aligned}$$

The model \(\{\check{\mathcal {H}}_{(K,\widetilde{J},R)}\}\) has K components, the set \(\widetilde{J}\) for relevant indices and R is the vector of the ranks of the matrix of regression coefficients in each group:

$$\begin{aligned} \check{\mathcal {H}}_{(K,\widetilde{J},R)}= \left\{ h^{(K,\widetilde{J},R)}_{\theta }(\varvec{y}|\varvec{x}) \right\} \end{aligned}$$
(18)

where

$$\begin{aligned} h^{(K,\widetilde{J},R)}_{\theta }(\varvec{y}|\varvec{x})&= \sum _{k=1}^{K} \frac{\pi _{k} \det (\varvec{P}_k)}{(2 \pi )^{q/2}} \exp \left( -\frac{(\varvec{P}_k\varvec{y}-(\varvec{\Phi }_k^{R_k})^{[\widetilde{J}]} \varvec{x} )^t(\varvec{P}_k\varvec{y}-(\varvec{\Phi }_k^{R_k})^{[\widetilde{J}]} \varvec{x} )}{2} \right) ;\\ \theta&=(\pi _1,\ldots , \pi _K,\varvec{\Phi }_1^{R_1},\ldots , \varvec{\Phi }_K^{R_K}, \varvec{P}_1,\ldots ,\varvec{P}_K) \in \Pi _K \times \Psi _K^R \times \left( \mathbb {R}_+^{q} \right) ^K ;\\ \Psi _K^R&= \left\{ (\varvec{\Phi }_1^{R_1},\ldots , \varvec{\Phi }_K^{R_K}) \in \left( \mathbb {R}^{q\times p} \right) ^K | \text {Rank}(\varvec{\Phi }_1) = R_1,\ldots , \text {Rank}(\varvec{\Phi }_K)=R_K \right\} ; \end{aligned}$$

and \(\widetilde{\mathcal {M}}^R = \mathcal {K} \times \widetilde{\mathcal {J}} \times \mathcal {R}\). Denote by \(\mathcal {K}\subset \mathbb {N}^*\) the possible number of components, \(\widetilde{\mathcal {J}}\) a collection of subsets of \(\{1,\ldots ,q\} \times \{1,\ldots ,p\}\) and \(\mathcal {R}\) the set of vectors of size \(K \in \mathcal {K}\) with rank values for each mean matrix. We compute the MLE under the rank constraint thanks to an EM algorithm. Indeed, we constrain the estimation of \(\varvec{\Phi }_k\), for all k, to have a rank equal to \(R_k\), in keeping only the \(R_k\)th largest singular values. More details are given in “EM algorithm for the rank procedure”. This finally leads to an estimator of the mean with row sparsity and low rank for each model.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Devijver, E. Model-based regression clustering for high-dimensional data: application to functional data. Adv Data Anal Classif 11, 243–279 (2017). https://doi.org/10.1007/s11634-016-0242-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0242-1

Keywords

Mathematics Subject Classification

Navigation