Data Mining and Knowledge Discovery

, Volume 32, Issue 6, pp 1509–1560 | Cite as

Robust finite mixture regression for heterogeneous targets

  • Jian LiangEmail author
  • Kun Chen
  • Ming Lin
  • Changshui Zhang
  • Fei Wang


Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that (1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, (2) achieves shared feature selection among tasks and cluster components, and (3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance.


Finite Mixture Regression Mixed-type response Incomplete targets Anomaly detection Task clustering 



The authors would like to thank the editors and reviewers for their valuable suggestions on improving this paper. This work of Jian Liang and Changshui Zhang is (jointly or partly) funded by National Natural Science Foundation of China under Grant No. 61473167 and Beijing Natural Science Foundation under Grant No. L172037. Kun Chen’s work is partially supported by U.S. National Science Foundation under Grants DMS-1613295 and IIS-1718798. The work of Fei Wang is supported by National Science Foundation under Grants IIS-1650723 and IIS-1716432.


  1. Aho K, Derryberry D, Peterson T (2014) Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95(3):631–636Google Scholar
  2. Alfò M, Salvati N, Ranallli MG (2016) Finite mixtures of quantile and M-quantile regression models. Stat Comput 27:1–24MathSciNetzbMATHGoogle Scholar
  3. Argyriou A, Evgeniou T, Pontil M (2007a) Multi-task feature learning. In: Advances in neural information processing systems, pp 41–48Google Scholar
  4. Argyriou A, Pontil M, Ying Y, Micchelli CA (2007b) A spectral regularization framework for multi-task structure learning. In: Advances in neural information processing systems, pp 25–32Google Scholar
  5. Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. J Stat Comput Simul 86(4):771–787MathSciNetGoogle Scholar
  6. Bartolucci F, Scaccia L (2005) The use of mixtures for dealing with non-normal regression errors. Comput Stat Data Anal 48(4):821–834MathSciNetzbMATHGoogle Scholar
  7. Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148MathSciNetzbMATHGoogle Scholar
  8. Becker SR, Candès EJ, Grant MC (2011) Templates for convex cone problems with applications to sparse signal recovery. Math Program Comput 3(3):165–218MathSciNetzbMATHGoogle Scholar
  9. Bhat HS, Kumar N (2010) On the derivation of the Bayesian information criterion. School of Natural Sciences, University of California, OaklandGoogle Scholar
  10. Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:705–1732MathSciNetzbMATHGoogle Scholar
  11. Bishop CM (2006) Pattern recognition. Mach Learn 128:1–58Google Scholar
  12. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122zbMATHGoogle Scholar
  13. Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772MathSciNetzbMATHGoogle Scholar
  14. Chen X, Kim S, Lin Q, Carbonell JG, Xing EP (2010) Graph-structured multi-task regression and an efficient optimization method for general fused lasso. ArXiv preprint arXiv:1005.3579
  15. Chen J, Zhou J, Ye J (2011) Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 42–50Google Scholar
  16. Chen J, Liu J, Ye J (2012a) Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Trans Knowl Discov Data (TKDD) 5(4):22Google Scholar
  17. Chen K, Chan KS, Stenseth NC (2012b) Reduced rank stochastic regression with a sparse singular value decomposition. J R Stat Soc Ser B (Stat Methodol) 74(2):203–221MathSciNetGoogle Scholar
  18. Cover TM, Thomas JA (2012) Elements of information theory. Wiley, HobokenzbMATHGoogle Scholar
  19. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38MathSciNetzbMATHGoogle Scholar
  20. Doğru FZ, Arslan O (2016) Robust mixture regression using mixture of different distributions. In: Agostinelli C, Basu A, Filzmoser P, Mukherjee D (eds) Recent advances in robust statistics: theory and applications. Springer, New Delhi, pp 57–79zbMATHGoogle Scholar
  21. Doğru FZ, Arslan O (2017) Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling. Commun Stat Theory Methods 46(21):10,879–10,896MathSciNetzbMATHGoogle Scholar
  22. Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression: models, methods and applications. Springer, BerlinzbMATHGoogle Scholar
  23. Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148MathSciNetzbMATHGoogle Scholar
  24. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186–193Google Scholar
  25. Gong P, Ye J, Zhang C (2012a) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 895–903Google Scholar
  26. Gong P, Ye J, Zhang C (2012b) Multi-stage multi-task feature learning. In: Advances in neural information processing systems, pp 1988–1996Google Scholar
  27. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36Google Scholar
  28. He J, Lawrence R (2011) A graph-based framework for multi-task multi-view learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 25–32Google Scholar
  29. Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27(4):481–499MathSciNetzbMATHGoogle Scholar
  30. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87Google Scholar
  31. Jacob L, Vert J, Bach FR (2009) Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems, pp 745–752Google Scholar
  32. Jalali A, Sanghavi S, Ruan C, Ravikumar PK (2010) A dirty model for multi-task learning. In: Advances in neural information processing systems, pp 964–972Google Scholar
  33. Ji S, Ye J (2009) An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 457–464Google Scholar
  34. Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40Google Scholar
  35. Jin X, Zhuang F, Pan SJ, Du C, Luo P, He Q (2015) Heterogeneous multi-task semantic feature learning for classification. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1847–1850Google Scholar
  36. Jorgensen B (1987) Exponential dispersion models. J R Stat Soc Ser B (Methodol) 49:127–162MathSciNetzbMATHGoogle Scholar
  37. Khalili A (2011) An overview of the new feature selection methods in finite mixture of regression models. J Iran Stat Soc 10(2):201–235MathSciNetzbMATHGoogle Scholar
  38. Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Stat Assoc 102(479):1025–1038MathSciNetzbMATHGoogle Scholar
  39. Koller D (1996) Toward optimal feature selection. In: Proceedings of the 13th international conference on machine learning, pp 284–292Google Scholar
  40. Kubat M (2015) An introduction to machine learning. Springer, BerlinzbMATHGoogle Scholar
  41. Kumar A, Daumé III H (2012) Learning task grouping and overlap in multi-task learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1723–1730Google Scholar
  42. Lim H, Narisetty NN, Cheon S (2016) Robust multivariate mixture regression models with incomplete data. J Stat Comput Simul 87:1–20MathSciNetGoogle Scholar
  43. Law MH, Jain AK, Figueiredo M (2002) Feature selection in mixture-based clustering. In: Advances in neural information processing systems, pp 625–632Google Scholar
  44. Li S, Liu ZQ, Chan AB (2014) Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 482–489Google Scholar
  45. Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient \(\ell _{2,1}\)-norm minimization. In: Proceedings of the 25th conference on uncertainty in artificial intelligence. AUAI Press, pp 339–348Google Scholar
  46. McLachlan G, Peel D (2004) Finite mixture models. Wiley, HobokenzbMATHGoogle Scholar
  47. Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. Springer, Dordrecht, pp 355–368Google Scholar
  48. Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of statistical sciences. Wiley, HobokenGoogle Scholar
  49. Nesterov Y et al (2007) Gradient methods for minimizing composite objective function. Technical report, UCLGoogle Scholar
  50. Passos A, Rai P, Wainer J, Daumé III H (2012) Flexible modeling of latent task structures in multitask learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1283–1290Google Scholar
  51. Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319Google Scholar
  52. She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647MathSciNetGoogle Scholar
  53. She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639MathSciNetzbMATHGoogle Scholar
  54. Städler N, Bühlmann P, Van De Geer S (2010) \(\ell _1\)-penalization for mixture regression models. Test 19(2):209–256MathSciNetzbMATHGoogle Scholar
  55. Strehl A, Ghosh J (2002a) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617MathSciNetzbMATHGoogle Scholar
  56. Strehl A, Ghosh J (2002b) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: 18th national conference on artificial intelligence. American Association for Artificial Intelligence, pp 93–98Google Scholar
  57. Tan Z, Kaddoum R, Le Yi Wang HW (2010) Decision-oriented multi-outcome modeling for anesthesia patients. Open Biomed Eng J 4:113Google Scholar
  58. Van de Geer SA (2000) Applications of empirical process theory, vol 91. Cambridge University Press, CambridgezbMATHGoogle Scholar
  59. Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71Google Scholar
  60. Van Der Vaart AW, Wellner JA (1996) Weak convergence. Springer, BerlinzbMATHGoogle Scholar
  61. Wedel M, DeSarbo WS (1995) A mixture likelihood approach for generalized linear models. J Classif 12(1):21–55zbMATHGoogle Scholar
  62. Weruaga L, Vía J (2015) Sparse multivariate gaussian mixture regression. IEEE Trans Neural Netw Learn Syst 26(5):1098–1108MathSciNetGoogle Scholar
  63. Wang HX, bing Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recognit Lett 25(6):701–710Google Scholar
  64. Yang X, Kim S, Xing EP (2009) Heterogeneous multitask learning with joint sparsity constraints. In: Advances in neural information processing systems, pp 2151–2159Google Scholar
  65. Yuksel SE, Wilson JN, Gader PD (2012) Twenty years of mixture of experts. IEEE Trans Neural Netw Learn Syst 23(8):1177–1193Google Scholar
  66. Zhang D, Shen D, Initiative ADN et al (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2):895–907Google Scholar
  67. Zhang Y, Yeung DY (2011) Multi-task learning in heterogeneous feature spaces. In: 25th AAAI conference on artificial intelligence and the 23rd innovative applications of artificial intelligence conference, AAAI-11/IAAI-11, San Francisco, CA, 7–11 August 2011, Code 87049, Proceedings of the National Conference on Artificial Intelligence, p 574Google Scholar
  68. Zhou J, Chen J, Ye J (2011) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems, pp 702–710Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Department of Automation, State Key Lab of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList)Tsinghua UniversityBeijingPeople’s Republic of China
  2. 2.Department of StatisticsUniversity of ConnecticutStorrsUSA
  3. 3.Department of Computational Medicine and BioinformaticsUniversity of MichiganAnn ArborUSA
  4. 4.Department of Healthcare Policy and ResearchCornell UniversityNew York CityUSA

Personalised recommendations