Advertisement

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Abstract

The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, a Gaussian distribution is adopted for both the covariates and the responses given the covariates. To make the approach robust with respect to the presence of mildly atypical observations, the contaminated Gaussian CWM is introduced. In addition to the parameters of the Gaussian CWM, each mixture component has a parameter controlling the proportion of outliers, one controlling the proportion of leverage points, one specifying the degree of contamination with respect to the response variables, and another specifying the degree of contamination with respect to the covariates. Crucially, these parameters do not have to be specified a priori, adding flexibility to the approach. Furthermore, once the model is estimated and the observations are assigned to the components, a finer intra-group classification in typical points, (mild) outliers, good leverage points, and bad leverage points—concepts of primary importance in robust regression analysis—can be directly obtained. Relations with other mixture-based contaminated models are analyzed, identifiability conditions are provided, an expectation-conditional maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared with other procedures. A sensitivity study is also conducted based on a real data set.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

References

  1. AITKEN, A.C. (1926),“A Series Formula for the Roots of Algebraic and Transcendental Equations”, Proceedings of the Royal Society of Edinburgh, 45(1), 14–22.

  2. AITKIN, M., and WILSON, G.T. (1980), “Mixture Models, Outliers, and the EM Algorithm”, Technometrics, 22(3), 325–331.

  3. BAGNATO, L., and PUNZO, A. (2013), “Finite Mixtures of Unimodal Beta and Gamma Densities and the k-bumps Algorithm”, Computational Statistics, 28(4), 1571–1597.

  4. BAGNATO, L., PUNZO, A., and ZOIA, M.G. (2017), “The Multivariate Leptokurtic-Normal Distribution and Its Application in Model-Based Clustering”, Canadian Journal of Statistics, 45(1), 95–119.

  5. BAI, X., YAO, W., and BOYER, J.E. (2012), “Robust Fitting of Mixture Regression Models”, Computational Statistics and Data Analysis, 56(7), 2347–2359.

  6. BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49(3), 803–821.

  7. BERKANE, M., and BENTLER, P.M. (1988), “Estimation of Contamination Parameters and Identification of Outliers in Multivariate Data”, Sociological Methods and Research, 17(1), 55–64.

  8. BERTA, P., INGRASSIA, S., PUNZO, A., and VITTADINI, G. (2016), “Multilevel Cluster-Weighted Models for the Evaluation of Hospitals”, METRON, 74(3), 275–292.

  9. BIERNACKI, C., CELEUX, G., and GOVAERT, G. (2003), “Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models”, Computational Statistics & Data Analysis, 41(3-4), 561–575.

  10. BÖHNING, D., DIETZ, E., SCHAUB, R., SCHLATTMANN, P., and LINDSAY, B. (1994), “The Distribution of the Likelihood Ratio for Mixtures of Densities from the One-Parameter Exponential Family”, Annals of the Institute of Statistical Mathematics, 46(2), 373–388.

  11. BROWNE, R.P., SUBEDI, S., and MCNICHOLAS, P.D. (2013), “Constrained Optimization for a Subset of the Gaussian Parsimonious Clustering Models”, arXiv.org e-print 1306.5824, available at http://arxiv.org/abs/1306.5824.

  12. CELEUX, G., and GOVAERT, G. (1995), “Gaussian Parsimonious Clustering Models”, Pattern Recognition, 28(5), 781–793.

  13. CELEUX, G., HURN, M., and ROBERT, C.P. (2000), “Computational and Inferential Difficulties with Mixture Posterior Distributions”, Journal of the American Statistical Association, 95(451), 957–970.

  14. CUESTA-ALBERTOS, J.A., GORDALIZA, A., and MATRÁN, C. (1997), “Trimmed k-Means: An Attempt to Robustify Quantizers”, The Annals of Statistics, 25(2), 553–576.

  15. DANG, U.J., PUNZO, A., MCNICHOLAS, P.D., INGRASSIA, S., and BROWNE, R.P. (2017), “Multivariate Response and Parsimony for Gaussian Cluster-Weighted Models”, Journal of Classification, 34(1), 4–34.

  16. DAVIES, L., and GATHER, U. (1993), “The Identification of Multiple Outliers”, Journal of the American Statistical Association, 88(423), 782–792.

  17. DEMPSTER, A., LAIRD, N., and RUBIN, D. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 39(1), 1–38.

  18. DESARBO, W.S., and CRON, W.L. (1988), “A Maximum Likelihood Methodology for Clusterwise Linear Regression”, Journal of Classification, 5(2), 249–282.

  19. FRALEY, C., RAFTERY, A.E., MURPHY, T.B., and SCRUCCA, L. (2012), “mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation”, Technical Report 597, Department of Statistics, University of Washington, Seattle, Washington, USA.

  20. FRÜHWIRTH-SCHNATTER, S. (2006), Finite Mixture and Markov Switching Models, New York: Springer.

  21. GALIMBERTI, G., and SOFFRITTI, G. (2014), “A Multivariate Linear Regression Analysis Using Finite Mixtures of t Distributions”, Computational Statistics and Data Analysis, 71, 138–150.

  22. GARCÍA-ESCUDERO, L.A., GORDALIZA, A., MAYO-ISCAR, A., and SAN MARTIN, R. (2010), “Robust Clusterwise Linear Regression Through Trimming”, Computational Statistics and Data Analysis, 54(12), 3057–3069.

  23. GARCÍA-ESCUDERO, L.A., GORDALIZA, A., SAN MARTIN, R., VAN AELST, S., and ZAMAR, R. (2009), “Robust Linear Clustering”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1), 301–318.

  24. GERSHENFELD, N. (1997), “Nonlinear Inference and Cluster-Weighted Modeling”, Annals of the New York Academy of Sciences, 808(1), 18–24.

  25. GÓMEZ, E., GÓMEZ-VIILEGAS, M.A., and MARIN, J.M. (1998), “A Multivariate Generalization of the Power Exponential Family of Distributions”, Communications in Statistics-Theory and Methods, 27(3), 589–600.

  26. HARRINGTON, J. (2012), lga: “Tools for Linear Grouping Analysis (LGA)”, R package version 1.1-1, available at https://cran.r-project.org/web/packages/lga/index.html.

  27. HARTIGAN, J.A. (1985), “Statistical Theory in Clustering”, Journal of Classification, 2(1), 63–76.

  28. HENNIG, C. (2000), “Identifiablity of Models for Clusterwise Linear Regression”, Journal of Classification, 17(2), 273–296.

  29. HENNIG, C. (2002), “Fixed Point Clusters for Linear Regression: Computation and Comparison”, Journal of Classification, 19(2), 249–276.

  30. HENNIG, C. (2004), “Breakdown Points forMaximum Likelihood Estimators of Location-Scale Mixtures”, The Annals of Statistics, 32(4), 1313–1340.

  31. INGRASSIA, S. (2004), “A Likelihood-Based Constrained Algorithm For Multivariate Normal Mixture Models”, Statistical Methods and Applications, 13(2), 151–166.

  32. INGRASSIA, S., MINOTTI, S.C., and PUNZO, A. (2014), “Model-Based Clustering via Linear Cluster-Weighted Models”, Computational Statistics and Data Analysis, 71, 159–182.

  33. INGRASSIA, S., MINOTTI, S.C., and VITTADINI, G. (2012), “Local Statistical Modeling via the Cluster-Weighted Approach with Elliptical Distributions”, Journal of Classification, 29(3), 363–401.

  34. INGRASSIA, S., and PUNZO, A. (2016), “Decision Boundaries for Mixtures of Regressions”, Journal of the Korean Statistical Society, 45(2), 295–306.

  35. INGRASSIA, S., PUNZO, A., VITTADINI, G., and MINOTTI, S.C. (2015), “The Generalized Linear Mixed Cluster-Weighted Model”, Journal of Classification, 32(1), 85–113.

  36. INGRASSIA, S., and ROCCI, R. (2007), “Constrained Monotone EM Algorithms for Finite Mixture of Multivariate Gaussians”, Computational Statistics and Data Analysis, 51(11), 5339–5351.

  37. KARLIS, D., and XEKALAKI, E. (2003), “Choosing Initial Values for the EM Algorithm for Finite Mixtures”, Computational Statistics and Data Analysis, 41(3–4), 577–590.

  38. LANGE, K.L., LITTLE, R.J.A., and TAYLOR, J.M.G. (1989), “Robust Statistical Modeling Using the t Distribution”, Journal of the American Statistical Association, 84(408), 881–896.

  39. LITTLE, R.J.A. (1988), “Robust Estimation of the Mean and Covariance Matrix from Data with Missing Values”, Applied Statistics, 37(1), 23–38.

  40. LÜTKEPOHL, H. (1996), Handbook of Matrices, Chicester: Wiley.

  41. MARDIA, K.V., KENT, J.T., and BIBBY, J.M. (1997), Multivariate Analysis, Probability and Mathematical Statistics, London: Academic Press.

  42. MARUOTTI, A., and PUNZO, A. (2016), “Model-Based Time-Varying Clustering of Multivariate Longitudinal Data with Covariates and Outliers”, Computational Statistics and Data Analysis, to appear, DOI: 10.1016/j.csda.2016.05.024.

  43. MAZZA, A., and PUNZO, A. (2017),“Mixtures of Multivariate Contaminated Normal Regression Models”, Statistical Papers, submitted.

  44. MAZZA, A., PUNZO, A., and INGRASSIA, S. (2017), “flexCWM: A Flexible Framework for Cluster-Weighted Models”, Journal of Statistical Software, 1–27.

  45. MCLACHLAN, G., and KRISHNAN, T. (2007), The EM Algorithm and Extensions (2nd ed.), Vol. 382, Wiley Series in Probability and Statistics, New York: John Wiley and Sons.

  46. MCLACHLAN, G.J., and BASFORD, K.E. (1988), Mixture Models: Inference and Applications to Clustering, New York: Marcel Dekker.

  47. MCLACHLAN, G.J., and Peel, D. (2000), Finite Mixture Models, New York: John Wiley and Sons.

  48. MCNICHOLAS, P.D. (2016), Mixture Model-Based Classification, Boca Raton: Chapman and Hall/CRC Press.

  49. MCNICHOLAS, P.D., MURPHY, T.B., MCDAID, A.F., and FROST, D. (2010), “Serial and Parallel Implementations of Model-Based Clustering via Parsimonious Gaussian Mixture Models”, Computational Statistics and Data Analysis, 54(3), 711–723.

  50. MENG, X.-L., and RUBIN, D.B. (1993), “Maximum Likelihood Estimation via the ECM Algorithm: A General Framework”, Biometrika, 80(2), 267–278.

  51. NEYKOV, N., FILZMOSER, P., DIMOVA, R., and NEYTCHEV, P. (2007), “Robust Fitting of Mixtures Using the Trimmed Likelihood Estimator”, Computational Statistics and Data Analysis, 52(1), 299–308.

  52. PUNZO, A. (2014), “Flexible Mixture Modeling with the Polynomial Gaussian Cluster-Weighted Model”, Statistical Modelling, 14(3), 257–291.

  53. PUNZO, A., BROWNE, R.P., and MCNICHOLAS, P.D. (2016), “Hypothesis Testing for MixtureModel Selection”, Journal of Statistical Computation and Simulation, 86(14), 2797–2818.

  54. PUNZO, A., and INGRASSIA, S. (2013), “On the Use of the Generalized Linear Exponential Cluster-Weighted Model to Assess Local Linear Independence in Bivariate Data”, QdS - Journal of Methodological and Applied Statistics, 15, 131–144.

  55. PUNZO, A., and INGRASSIA, S. (2015), “Parsimonious Generalized Linear Gaussian Cluster-Weighted Models”, in Advances in Statistical Models for Data Analysis. Studies in Classification, Data Analysis and Knowledge Organization, eds. I. Morlini, T. Minerva, and M. Vichi, Switzerland: Springer International Publishing, pp. 201–209.

  56. PUNZO, A., and INGRASSIA, S. (2016), “Clustering Bivariate Mixed-Type Data via the Cluster-Weighted Model”, Computational Statistics, 31(3), 989–1013.

  57. PUNZO, A., and MARUOTTI, A. (2016), “Clustering Multivariate Longitudinal Observations: The Contaminated Gaussian Hidden Markov Model”, Journal of Computational and Graphical Statistics, 25(4), 1097–1116.

  58. PUNZO, A., MAZZA, A., and MCNICHOLAS, P.D. (2017),“ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions”, Journal of Statistical Software, 1–25.

  59. PUNZO, A., and MCNICHOLAS, P.D. (2014), “Robust High-Dimensional Modeling with the Contaminated Gaussian Distribution”, arXiv.org e-print 1408.2128, available at http://arxiv.org/abs/1408.2128.

  60. PUNZO, A., and MCNICHOLAS, P.D. (2016), “Parsimonious Mixtures of Multivariate Contaminated Normal Distributions”, Biometrical Journal, 58(6), 1506–1537.

  61. R CORE TEAM (2013), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, available at http://www.Rproject.org/

  62. RITTER, G. (2015), Robust Cluster Analysis and Variable Selection (Vol. 137), Chapman and Hall/CRC Monographs on Statistics and Applied Probability, Boca Raton: CRC Press.

  63. ROUSSEEUW, P.J., and DRIESSEN, K.V. (1999), “A Fast Algorithm for the Minimum Covariance Determinant Estimator”, Technometrics, 41(3), 212–223.

  64. ROUSSEEUW, P.J., and LEROY, A.M. (2005), Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, Wiley.

  65. ROUSSEEUW, P.J., and VAN ZOMEREN, B.C. (1990), “Unmasking Multivariate Outliers and Leverage Points”, Journal of the American Statistical Association, 85(411), 633–639.

  66. SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, The Annals of Statistics, 6(2), 461–464.

  67. SEO, B., and KIM, D. (2012), “Root Selection in Normal Mixture Models”, Computational Statistics and Data Analysis, 56(8), 2454–2470.

  68. SONG, W., YAO, W., and XING, Y. (2014), “Robust Mixture Regression Model Fitting by Laplace Distribution”, Computational Statistics and Data Analysis, 71, 128–137.

  69. STEPHENS, M. (2000), “Dealing with Label Switching in Mixture Models”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4), 795–809.

  70. SUBEDI, S., PUNZO, A., INGRASSIA, S., and MCNICHOLAS, P.D. (2013), “Clustering and Classification via Cluster-Weighted Factor Analyzers”, Advances in Data Analysis and Classification, 7(1), 5–40.

  71. SUBEDI, S., PUNZO, A., INGRASSIA, S., and MCNICHOLAS, P.D. (2015), “Cluster-Weighted t-Factor Analyzers for Robust Model-Based Clustering and Dimension Reduction”, Statistical Methods & Applications, 24(4), 623–649.

  72. TITTERINGTON, D.M., SMITH, A.F.M., and MAKOV, U.E. (1985), Statistical Analysis of Finite Mixture Distributions, New York: John Wiley & Sons.

  73. TUKEY, J.W. (1960), “A Survey of Sampling from Contaminated Distributions”, in Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford Studies in Mathematics and Statistics, ed. I. Olkin, California: Stanford University Press, Ch. 39, pp. 448–485.

  74. VAN AELST, S., WANG, X.S., ZAMAR, R.H., and ZHU, R. (2006), “Linear Grouping Using Orthogonal Regression”, Computational Statistics and Data Analysis, 50(5), 1287–1312.

  75. WEDEL, M. (2002), “Concomitant Variables in Finite Mixture Models”, Statistica Neerlandica, 56(3), 362–375.

  76. YAO, W. (2012), “Model Based Labeling for Mixture Models”, Statistics and Computing, 22(2), 337–347.

  77. YAO, W., and LINDSAY, B.G. (2009), “Bayesian Mixture Labeling by Highest Posterior Density”, Journal of the American Statistical Association, 104(486), 758–767.

  78. YAO, W., WEI, Y., and YU, C. (2014), “Robust Mixture Regression Using the t-Distribution”, Computational Statistics and Data Analysis, 71, 116–127.

Download references

Author information

Correspondence to Antonio Punzo.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Punzo, A., McNicholas, P.D. Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model. J Classif 34, 249–293 (2017) doi:10.1007/s00357-017-9234-x

Download citation

Keywords

  • Mixture models
  • Cluster-weighted models
  • Model-based clustering
  • Contaminated Gaussian distribution
  • Robust regression