Abstract
We consider a finite mixture of regressions (FMR) model for high-dimensional inhomogeneous data where the number of covariates may be much larger than sample size. We propose an ℓ 1-penalized maximum likelihood estimator in an appropriate parameterization. This kind of estimation belongs to a class of problems where optimization and theory for non-convex functions is needed. This distinguishes itself very clearly from high-dimensional estimation with convex loss- or objective functions as, for example, with the Lasso in linear or generalized linear models. Mixture models represent a prime and important example where non-convexity arises.
For FMR models, we develop an efficient EM algorithm for numerical optimization with provable convergence properties. Our penalized estimator is numerically better posed (e.g., boundedness of the criterion function) than unpenalized maximum likelihood estimation, and it allows for effective statistical regularization including variable selection. We also present some asymptotic theory and oracle inequalities: due to non-convexity of the negative log-likelihood function, different mathematical arguments are needed than for problems with convex losses. Finally, we apply the new method to both simulated and real data.
Similar content being viewed by others
References
Bertsekas D (1995) Nonlinear programming. Athena Scientific, Belmont
Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732
Bunea F, Tsybakov A, Wegkamp M (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194
Cai T, Wang L, Xu G (2009a) Stable recovery of sparse signals and an oracle inequality. Tech rep, Department of Statistics, University of Pennsylvania
Cai T, Xu G, Zhang J (2009b) On recovery of sparse signals via ℓ 1 minimization. IEEE Trans Inf Theory 55:3388–3397
Candès E, Plan Y (2009) Near-ideal model selection by ℓ 1 minimization. Ann Stat 37:2145–2177
Candès E, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51:4203–4215
Candès E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n (with discussion). Ann Stat 35:2313–2404
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B 39:1–38
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Friedman J, Hastie T, Hoefling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332
Friedman J, Hastie T, Tibshirani R (2008) Regularized paths for generalized linear models via coordinate descent. Tech rep, Department of Statistics, Stanford University
Fu WJ (1998) Penalized regression: the Bridge versus the Lasso. J Comput Graph Stat 7:397–416
Greenshtein E, Ritov Y (2004) Persistence in high-dimensional predictor selection and the virtue of over-parametrization. Bernoulli 10:971–988
Grün B, Leisch F (2007) Fitting finite mixtures of generalized linear regressions in R. Comput Stat Data Anal 51:5247–5252. doi:10.1016/j.csda.2006.08.014
Grün B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28:1–35. http://www.jstatsoft.org/v28/i04/
Huang J, Ma S, Zhang CH (2008) Adaptive Lasso for sparse high-dimensional regression models. Stat Sin 18:1603–1618
Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Stat Assoc 102:1025–1038
Koltchinskii V (2009) The Dantzig selector and sparsity oracle inequalities. Bernoulli 15:799–828
Lehmann E (1983) Theory of point estimation. Wadsworth and Brooks/Cole, Pacific Grove
Leisch F (2004) FlexMix: a general framework for finite mixture models and latent class regression in R. J Stat Softw 11:1–18. http://www.jstatsoft.org/v11/i08/
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Meier L, van de Geer S, Bühlmann P (2008) The group Lasso for logistic regression. J R Stat Soc, Ser B 70:53–71
Meinshausen N, Bühlmann P (2006) High dimensional graphs and variable selection with the Lasso. Ann Stat 34:1436–1462
Meinshausen N, Yu B (2009) Lasso-type recovery of sparse representations for high-dimensional data. Ann Stat 37:246–270
Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc, Ser B 58:267–288
Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109:475–494
Tseng P, Yun S (2008) A coordinate gradient descent method for nonsmooth separable minimization. Math Program, Ser B 117:387–423
Tsybakov A (2004) Optimal aggregation of classifiers in statistical learning. Ann Stat 32:135–166
van de Geer S (2000) Empirical processes in M-estimation. University Press, Cambridge
van de Geer S (2008) High-dimensional generalized linear models and the Lasso. Ann Stat 36:614–645
van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392
van de Geer S, Zhou S, Bühlmann P (2010) Prediction and variable selection with the Adaptive Lasso. Arxiv preprint arXiv:1001.5176 [mathST]
van der Vaart A (2007) Asymptotic statistics. University Press, Cambridge
van der Vaart A, Wellner J (1996) Weak convergence and empirical processes. Springer, Berlin
Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ 1-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55:2183–2202
Wu C (1983) On the convergence properties of the EM algorithm. Ann Stat 11:95–103
Zhang T (2009) Some sharp performance bounds for least squares regression with L1 regularization. Ann Stat 37:2109 –2144
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zhang CH, Huang J (2008) The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Stat 36:1567–1594
Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563
Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101:1418–1429
Author information
Authors and Affiliations
Corresponding author
Additional information
This invited paper is discussed in the comments available at: doi:10.1007/s11749-010-0198-y, doi:10.1007/s11749-010-0199-x, doi:10.1007/s11749-010-0200-8, doi:10.1007/s11749-010-0201-7, doi:10.1007/s11749-010-0202-6.
Rights and permissions
About this article
Cite this article
Städler, N., Bühlmann, P. & van de Geer, S. ℓ1-penalization for mixture regression models. TEST 19, 209–256 (2010). https://doi.org/10.1007/s11749-010-0197-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-010-0197-z
Keywords
- Adaptive Lasso
- Finite mixture models
- Generalized EM algorithm
- High-dimensional estimation
- Lasso
- Oracle inequality