Skip to main content
Log in

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, a Gaussian distribution is adopted for both the covariates and the responses given the covariates. To make the approach robust with respect to the presence of mildly atypical observations, the contaminated Gaussian CWM is introduced. In addition to the parameters of the Gaussian CWM, each mixture component has a parameter controlling the proportion of outliers, one controlling the proportion of leverage points, one specifying the degree of contamination with respect to the response variables, and another specifying the degree of contamination with respect to the covariates. Crucially, these parameters do not have to be specified a priori, adding flexibility to the approach. Furthermore, once the model is estimated and the observations are assigned to the components, a finer intra-group classification in typical points, (mild) outliers, good leverage points, and bad leverage points—concepts of primary importance in robust regression analysis—can be directly obtained. Relations with other mixture-based contaminated models are analyzed, identifiability conditions are provided, an expectation-conditional maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared with other procedures. A sensitivity study is also conducted based on a real data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • AITKEN, A.C. (1926),“A Series Formula for the Roots of Algebraic and Transcendental Equations”, Proceedings of the Royal Society of Edinburgh, 45(1), 14–22.

    Article  MATH  Google Scholar 

  • AITKIN, M., and WILSON, G.T. (1980), “Mixture Models, Outliers, and the EM Algorithm”, Technometrics, 22(3), 325–331.

    Article  MATH  Google Scholar 

  • BAGNATO, L., and PUNZO, A. (2013), “Finite Mixtures of Unimodal Beta and Gamma Densities and the k-bumps Algorithm”, Computational Statistics, 28(4), 1571–1597.

    Article  MathSciNet  MATH  Google Scholar 

  • BAGNATO, L., PUNZO, A., and ZOIA, M.G. (2017), “The Multivariate Leptokurtic-Normal Distribution and Its Application in Model-Based Clustering”, Canadian Journal of Statistics, 45(1), 95–119.

    Article  MathSciNet  Google Scholar 

  • BAI, X., YAO, W., and BOYER, J.E. (2012), “Robust Fitting of Mixture Regression Models”, Computational Statistics and Data Analysis, 56(7), 2347–2359.

    Article  MathSciNet  MATH  Google Scholar 

  • BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49(3), 803–821.

    Article  MathSciNet  MATH  Google Scholar 

  • BERKANE, M., and BENTLER, P.M. (1988), “Estimation of Contamination Parameters and Identification of Outliers in Multivariate Data”, Sociological Methods and Research, 17(1), 55–64.

    Article  Google Scholar 

  • BERTA, P., INGRASSIA, S., PUNZO, A., and VITTADINI, G. (2016), “Multilevel Cluster-Weighted Models for the Evaluation of Hospitals”, METRON, 74(3), 275–292.

    Article  MathSciNet  MATH  Google Scholar 

  • BIERNACKI, C., CELEUX, G., and GOVAERT, G. (2003), “Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models”, Computational Statistics & Data Analysis, 41(3-4), 561–575.

    Article  MathSciNet  MATH  Google Scholar 

  • BÖHNING, D., DIETZ, E., SCHAUB, R., SCHLATTMANN, P., and LINDSAY, B. (1994), “The Distribution of the Likelihood Ratio for Mixtures of Densities from the One-Parameter Exponential Family”, Annals of the Institute of Statistical Mathematics, 46(2), 373–388.

    Article  MATH  Google Scholar 

  • BROWNE, R.P., SUBEDI, S., and MCNICHOLAS, P.D. (2013), “Constrained Optimization for a Subset of the Gaussian Parsimonious Clustering Models”, arXiv.org e-print 1306.5824, available at http://arxiv.org/abs/1306.5824.

  • CELEUX, G., and GOVAERT, G. (1995), “Gaussian Parsimonious Clustering Models”, Pattern Recognition, 28(5), 781–793.

    Article  Google Scholar 

  • CELEUX, G., HURN, M., and ROBERT, C.P. (2000), “Computational and Inferential Difficulties with Mixture Posterior Distributions”, Journal of the American Statistical Association, 95(451), 957–970.

    Article  MathSciNet  MATH  Google Scholar 

  • CUESTA-ALBERTOS, J.A., GORDALIZA, A., and MATRÁN, C. (1997), “Trimmed k-Means: An Attempt to Robustify Quantizers”, The Annals of Statistics, 25(2), 553–576.

    Article  MathSciNet  MATH  Google Scholar 

  • DANG, U.J., PUNZO, A., MCNICHOLAS, P.D., INGRASSIA, S., and BROWNE, R.P. (2017), “Multivariate Response and Parsimony for Gaussian Cluster-Weighted Models”, Journal of Classification, 34(1), 4–34.

    Article  MathSciNet  MATH  Google Scholar 

  • DAVIES, L., and GATHER, U. (1993), “The Identification of Multiple Outliers”, Journal of the American Statistical Association, 88(423), 782–792.

    Article  MathSciNet  MATH  Google Scholar 

  • DEMPSTER, A., LAIRD, N., and RUBIN, D. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 39(1), 1–38.

    MathSciNet  MATH  Google Scholar 

  • DESARBO, W.S., and CRON, W.L. (1988), “A Maximum Likelihood Methodology for Clusterwise Linear Regression”, Journal of Classification, 5(2), 249–282.

    Article  MathSciNet  MATH  Google Scholar 

  • FRALEY, C., RAFTERY, A.E., MURPHY, T.B., and SCRUCCA, L. (2012), “mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation”, Technical Report 597, Department of Statistics, University of Washington, Seattle, Washington, USA.

  • FRÜHWIRTH-SCHNATTER, S. (2006), Finite Mixture and Markov Switching Models, New York: Springer.

    MATH  Google Scholar 

  • GALIMBERTI, G., and SOFFRITTI, G. (2014), “A Multivariate Linear Regression Analysis Using Finite Mixtures of t Distributions”, Computational Statistics and Data Analysis, 71, 138–150.

    Article  MathSciNet  Google Scholar 

  • GARCÍA-ESCUDERO, L.A., GORDALIZA, A., MAYO-ISCAR, A., and SAN MARTIN, R. (2010), “Robust Clusterwise Linear Regression Through Trimming”, Computational Statistics and Data Analysis, 54(12), 3057–3069.

    Article  MathSciNet  MATH  Google Scholar 

  • GARCÍA-ESCUDERO, L.A., GORDALIZA, A., SAN MARTIN, R., VAN AELST, S., and ZAMAR, R. (2009), “Robust Linear Clustering”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1), 301–318.

    Article  MathSciNet  MATH  Google Scholar 

  • GERSHENFELD, N. (1997), “Nonlinear Inference and Cluster-Weighted Modeling”, Annals of the New York Academy of Sciences, 808(1), 18–24.

    Article  Google Scholar 

  • GÓMEZ, E., GÓMEZ-VIILEGAS, M.A., and MARIN, J.M. (1998), “A Multivariate Generalization of the Power Exponential Family of Distributions”, Communications in Statistics-Theory and Methods, 27(3), 589–600.

    Article  MathSciNet  MATH  Google Scholar 

  • HARRINGTON, J. (2012), lga: “Tools for Linear Grouping Analysis (LGA)”, R package version 1.1-1, available at https://cran.r-project.org/web/packages/lga/index.html.

  • HARTIGAN, J.A. (1985), “Statistical Theory in Clustering”, Journal of Classification, 2(1), 63–76.

    Article  MathSciNet  MATH  Google Scholar 

  • HENNIG, C. (2000), “Identifiablity of Models for Clusterwise Linear Regression”, Journal of Classification, 17(2), 273–296.

    Article  MathSciNet  MATH  Google Scholar 

  • HENNIG, C. (2002), “Fixed Point Clusters for Linear Regression: Computation and Comparison”, Journal of Classification, 19(2), 249–276.

    Article  MathSciNet  MATH  Google Scholar 

  • HENNIG, C. (2004), “Breakdown Points forMaximum Likelihood Estimators of Location-Scale Mixtures”, The Annals of Statistics, 32(4), 1313–1340.

    Article  MathSciNet  MATH  Google Scholar 

  • INGRASSIA, S. (2004), “A Likelihood-Based Constrained Algorithm For Multivariate Normal Mixture Models”, Statistical Methods and Applications, 13(2), 151–166.

    Article  MathSciNet  Google Scholar 

  • INGRASSIA, S., MINOTTI, S.C., and PUNZO, A. (2014), “Model-Based Clustering via Linear Cluster-Weighted Models”, Computational Statistics and Data Analysis, 71, 159–182.

    Article  MathSciNet  Google Scholar 

  • INGRASSIA, S., MINOTTI, S.C., and VITTADINI, G. (2012), “Local Statistical Modeling via the Cluster-Weighted Approach with Elliptical Distributions”, Journal of Classification, 29(3), 363–401.

    Article  MathSciNet  MATH  Google Scholar 

  • INGRASSIA, S., and PUNZO, A. (2016), “Decision Boundaries for Mixtures of Regressions”, Journal of the Korean Statistical Society, 45(2), 295–306.

    Article  MathSciNet  MATH  Google Scholar 

  • INGRASSIA, S., PUNZO, A., VITTADINI, G., and MINOTTI, S.C. (2015), “The Generalized Linear Mixed Cluster-Weighted Model”, Journal of Classification, 32(1), 85–113.

    Article  MathSciNet  MATH  Google Scholar 

  • INGRASSIA, S., and ROCCI, R. (2007), “Constrained Monotone EM Algorithms for Finite Mixture of Multivariate Gaussians”, Computational Statistics and Data Analysis, 51(11), 5339–5351.

    Article  MathSciNet  MATH  Google Scholar 

  • KARLIS, D., and XEKALAKI, E. (2003), “Choosing Initial Values for the EM Algorithm for Finite Mixtures”, Computational Statistics and Data Analysis, 41(3–4), 577–590.

    Article  MathSciNet  MATH  Google Scholar 

  • LANGE, K.L., LITTLE, R.J.A., and TAYLOR, J.M.G. (1989), “Robust Statistical Modeling Using the t Distribution”, Journal of the American Statistical Association, 84(408), 881–896.

    MathSciNet  Google Scholar 

  • LITTLE, R.J.A. (1988), “Robust Estimation of the Mean and Covariance Matrix from Data with Missing Values”, Applied Statistics, 37(1), 23–38.

    Article  MATH  Google Scholar 

  • LÜTKEPOHL, H. (1996), Handbook of Matrices, Chicester: Wiley.

    MATH  Google Scholar 

  • MARDIA, K.V., KENT, J.T., and BIBBY, J.M. (1997), Multivariate Analysis, Probability and Mathematical Statistics, London: Academic Press.

    MATH  Google Scholar 

  • MARUOTTI, A., and PUNZO, A. (2016), “Model-Based Time-Varying Clustering of Multivariate Longitudinal Data with Covariates and Outliers”, Computational Statistics and Data Analysis, to appear, DOI: 10.1016/j.csda.2016.05.024.

  • MAZZA, A., and PUNZO, A. (2017),“Mixtures of Multivariate Contaminated Normal Regression Models”, Statistical Papers, submitted.

  • MAZZA, A., PUNZO, A., and INGRASSIA, S. (2017), “flexCWM: A Flexible Framework for Cluster-Weighted Models”, Journal of Statistical Software, 1–27.

  • MCLACHLAN, G., and KRISHNAN, T. (2007), The EM Algorithm and Extensions (2nd ed.), Vol. 382, Wiley Series in Probability and Statistics, New York: John Wiley and Sons.

  • MCLACHLAN, G.J., and BASFORD, K.E. (1988), Mixture Models: Inference and Applications to Clustering, New York: Marcel Dekker.

    MATH  Google Scholar 

  • MCLACHLAN, G.J., and Peel, D. (2000), Finite Mixture Models, New York: John Wiley and Sons.

    Book  MATH  Google Scholar 

  • MCNICHOLAS, P.D. (2016), Mixture Model-Based Classification, Boca Raton: Chapman and Hall/CRC Press.

    Book  MATH  Google Scholar 

  • MCNICHOLAS, P.D., MURPHY, T.B., MCDAID, A.F., and FROST, D. (2010), “Serial and Parallel Implementations of Model-Based Clustering via Parsimonious Gaussian Mixture Models”, Computational Statistics and Data Analysis, 54(3), 711–723.

    Article  MathSciNet  MATH  Google Scholar 

  • MENG, X.-L., and RUBIN, D.B. (1993), “Maximum Likelihood Estimation via the ECM Algorithm: A General Framework”, Biometrika, 80(2), 267–278.

    Article  MathSciNet  MATH  Google Scholar 

  • NEYKOV, N., FILZMOSER, P., DIMOVA, R., and NEYTCHEV, P. (2007), “Robust Fitting of Mixtures Using the Trimmed Likelihood Estimator”, Computational Statistics and Data Analysis, 52(1), 299–308.

    Article  MathSciNet  MATH  Google Scholar 

  • PUNZO, A. (2014), “Flexible Mixture Modeling with the Polynomial Gaussian Cluster-Weighted Model”, Statistical Modelling, 14(3), 257–291.

    Article  MathSciNet  Google Scholar 

  • PUNZO, A., BROWNE, R.P., and MCNICHOLAS, P.D. (2016), “Hypothesis Testing for MixtureModel Selection”, Journal of Statistical Computation and Simulation, 86(14), 2797–2818.

    Article  MathSciNet  Google Scholar 

  • PUNZO, A., and INGRASSIA, S. (2013), “On the Use of the Generalized Linear Exponential Cluster-Weighted Model to Assess Local Linear Independence in Bivariate Data”, QdS - Journal of Methodological and Applied Statistics, 15, 131–144.

    Google Scholar 

  • PUNZO, A., and INGRASSIA, S. (2015), “Parsimonious Generalized Linear Gaussian Cluster-Weighted Models”, in Advances in Statistical Models for Data Analysis. Studies in Classification, Data Analysis and Knowledge Organization, eds. I. Morlini, T. Minerva, and M. Vichi, Switzerland: Springer International Publishing, pp. 201–209.

    Chapter  Google Scholar 

  • PUNZO, A., and INGRASSIA, S. (2016), “Clustering Bivariate Mixed-Type Data via the Cluster-Weighted Model”, Computational Statistics, 31(3), 989–1013.

    Article  MathSciNet  MATH  Google Scholar 

  • PUNZO, A., and MARUOTTI, A. (2016), “Clustering Multivariate Longitudinal Observations: The Contaminated Gaussian Hidden Markov Model”, Journal of Computational and Graphical Statistics, 25(4), 1097–1116.

    Article  MathSciNet  Google Scholar 

  • PUNZO, A., MAZZA, A., and MCNICHOLAS, P.D. (2017),“ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions”, Journal of Statistical Software, 1–25.

  • PUNZO, A., and MCNICHOLAS, P.D. (2014), “Robust High-Dimensional Modeling with the Contaminated Gaussian Distribution”, arXiv.org e-print 1408.2128, available at http://arxiv.org/abs/1408.2128.

  • PUNZO, A., and MCNICHOLAS, P.D. (2016), “Parsimonious Mixtures of Multivariate Contaminated Normal Distributions”, Biometrical Journal, 58(6), 1506–1537.

    Article  MATH  Google Scholar 

  • R CORE TEAM (2013), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, available at http://www.Rproject.org/

  • RITTER, G. (2015), Robust Cluster Analysis and Variable Selection (Vol. 137), Chapman and Hall/CRC Monographs on Statistics and Applied Probability, Boca Raton: CRC Press.

  • ROUSSEEUW, P.J., and DRIESSEN, K.V. (1999), “A Fast Algorithm for the Minimum Covariance Determinant Estimator”, Technometrics, 41(3), 212–223.

    Article  Google Scholar 

  • ROUSSEEUW, P.J., and LEROY, A.M. (2005), Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, Wiley.

  • ROUSSEEUW, P.J., and VAN ZOMEREN, B.C. (1990), “Unmasking Multivariate Outliers and Leverage Points”, Journal of the American Statistical Association, 85(411), 633–639.

    Article  Google Scholar 

  • SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, The Annals of Statistics, 6(2), 461–464.

    Article  MathSciNet  MATH  Google Scholar 

  • SEO, B., and KIM, D. (2012), “Root Selection in Normal Mixture Models”, Computational Statistics and Data Analysis, 56(8), 2454–2470.

    Article  MathSciNet  MATH  Google Scholar 

  • SONG, W., YAO, W., and XING, Y. (2014), “Robust Mixture Regression Model Fitting by Laplace Distribution”, Computational Statistics and Data Analysis, 71, 128–137.

    Article  MathSciNet  Google Scholar 

  • STEPHENS, M. (2000), “Dealing with Label Switching in Mixture Models”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4), 795–809.

    Article  MathSciNet  MATH  Google Scholar 

  • SUBEDI, S., PUNZO, A., INGRASSIA, S., and MCNICHOLAS, P.D. (2013), “Clustering and Classification via Cluster-Weighted Factor Analyzers”, Advances in Data Analysis and Classification, 7(1), 5–40.

    Article  MathSciNet  MATH  Google Scholar 

  • SUBEDI, S., PUNZO, A., INGRASSIA, S., and MCNICHOLAS, P.D. (2015), “Cluster-Weighted t-Factor Analyzers for Robust Model-Based Clustering and Dimension Reduction”, Statistical Methods & Applications, 24(4), 623–649.

    Article  MathSciNet  MATH  Google Scholar 

  • TITTERINGTON, D.M., SMITH, A.F.M., and MAKOV, U.E. (1985), Statistical Analysis of Finite Mixture Distributions, New York: John Wiley & Sons.

    MATH  Google Scholar 

  • TUKEY, J.W. (1960), “A Survey of Sampling from Contaminated Distributions”, in Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford Studies in Mathematics and Statistics, ed. I. Olkin, California: Stanford University Press, Ch. 39, pp. 448–485.

  • VAN AELST, S., WANG, X.S., ZAMAR, R.H., and ZHU, R. (2006), “Linear Grouping Using Orthogonal Regression”, Computational Statistics and Data Analysis, 50(5), 1287–1312.

    Article  MathSciNet  MATH  Google Scholar 

  • WEDEL, M. (2002), “Concomitant Variables in Finite Mixture Models”, Statistica Neerlandica, 56(3), 362–375.

    Article  MathSciNet  MATH  Google Scholar 

  • YAO, W. (2012), “Model Based Labeling for Mixture Models”, Statistics and Computing, 22(2), 337–347.

    Article  MathSciNet  MATH  Google Scholar 

  • YAO, W., and LINDSAY, B.G. (2009), “Bayesian Mixture Labeling by Highest Posterior Density”, Journal of the American Statistical Association, 104(486), 758–767.

    Article  MathSciNet  MATH  Google Scholar 

  • YAO, W., WEI, Y., and YU, C. (2014), “Robust Mixture Regression Using the t-Distribution”, Computational Statistics and Data Analysis, 71, 116–127.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Punzo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Punzo, A., McNicholas, P.D. Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model. J Classif 34, 249–293 (2017). https://doi.org/10.1007/s00357-017-9234-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-017-9234-x

Keywords

Navigation