Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Punzo, Antonio; McNicholas, Paul. D.

doi:10.1007/s00357-017-9234-x

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Published: 20 June 2017

Volume 34, pages 249–293, (2017)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Antonio Punzo¹ &
Paul. D. McNicholas²

439 Accesses
46 Citations
3 Altmetric
Explore all metrics

Abstract

The Gaussian cluster-weighted model (CWM) is a mixture of regression models with random covariates that allows for flexible clustering of a random vector composed of response variables and covariates. In each mixture component, a Gaussian distribution is adopted for both the covariates and the responses given the covariates. To make the approach robust with respect to the presence of mildly atypical observations, the contaminated Gaussian CWM is introduced. In addition to the parameters of the Gaussian CWM, each mixture component has a parameter controlling the proportion of outliers, one controlling the proportion of leverage points, one specifying the degree of contamination with respect to the response variables, and another specifying the degree of contamination with respect to the covariates. Crucially, these parameters do not have to be specified a priori, adding flexibility to the approach. Furthermore, once the model is estimated and the observations are assigned to the components, a finer intra-group classification in typical points, (mild) outliers, good leverage points, and bad leverage points—concepts of primary importance in robust regression analysis—can be directly obtained. Relations with other mixture-based contaminated models are analyzed, identifiability conditions are provided, an expectation-conditional maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments and compared with other procedures. A sensitivity study is also conducted based on a real data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

AITKEN, A.C. (1926),“A Series Formula for the Roots of Algebraic and Transcendental Equations”, Proceedings of the Royal Society of Edinburgh, 45(1), 14–22.
Article MATH Google Scholar
AITKIN, M., and WILSON, G.T. (1980), “Mixture Models, Outliers, and the EM Algorithm”, Technometrics, 22(3), 325–331.
Article MATH Google Scholar
BAGNATO, L., and PUNZO, A. (2013), “Finite Mixtures of Unimodal Beta and Gamma Densities and the k-bumps Algorithm”, Computational Statistics, 28(4), 1571–1597.
Article MathSciNet MATH Google Scholar
BAGNATO, L., PUNZO, A., and ZOIA, M.G. (2017), “The Multivariate Leptokurtic-Normal Distribution and Its Application in Model-Based Clustering”, Canadian Journal of Statistics, 45(1), 95–119.
Article MathSciNet Google Scholar
BAI, X., YAO, W., and BOYER, J.E. (2012), “Robust Fitting of Mixture Regression Models”, Computational Statistics and Data Analysis, 56(7), 2347–2359.
Article MathSciNet MATH Google Scholar
BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49(3), 803–821.
Article MathSciNet MATH Google Scholar
BERKANE, M., and BENTLER, P.M. (1988), “Estimation of Contamination Parameters and Identification of Outliers in Multivariate Data”, Sociological Methods and Research, 17(1), 55–64.
Article Google Scholar
BERTA, P., INGRASSIA, S., PUNZO, A., and VITTADINI, G. (2016), “Multilevel Cluster-Weighted Models for the Evaluation of Hospitals”, METRON, 74(3), 275–292.
Article MathSciNet MATH Google Scholar
BIERNACKI, C., CELEUX, G., and GOVAERT, G. (2003), “Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models”, Computational Statistics & Data Analysis, 41(3-4), 561–575.
Article MathSciNet MATH Google Scholar
BÖHNING, D., DIETZ, E., SCHAUB, R., SCHLATTMANN, P., and LINDSAY, B. (1994), “The Distribution of the Likelihood Ratio for Mixtures of Densities from the One-Parameter Exponential Family”, Annals of the Institute of Statistical Mathematics, 46(2), 373–388.
Article MATH Google Scholar
BROWNE, R.P., SUBEDI, S., and MCNICHOLAS, P.D. (2013), “Constrained Optimization for a Subset of the Gaussian Parsimonious Clustering Models”, arXiv.org e-print 1306.5824, available at http://arxiv.org/abs/1306.5824.
CELEUX, G., and GOVAERT, G. (1995), “Gaussian Parsimonious Clustering Models”, Pattern Recognition, 28(5), 781–793.
Article Google Scholar
CELEUX, G., HURN, M., and ROBERT, C.P. (2000), “Computational and Inferential Difficulties with Mixture Posterior Distributions”, Journal of the American Statistical Association, 95(451), 957–970.
Article MathSciNet MATH Google Scholar
CUESTA-ALBERTOS, J.A., GORDALIZA, A., and MATRÁN, C. (1997), “Trimmed k-Means: An Attempt to Robustify Quantizers”, The Annals of Statistics, 25(2), 553–576.
Article MathSciNet MATH Google Scholar
DANG, U.J., PUNZO, A., MCNICHOLAS, P.D., INGRASSIA, S., and BROWNE, R.P. (2017), “Multivariate Response and Parsimony for Gaussian Cluster-Weighted Models”, Journal of Classification, 34(1), 4–34.
Article MathSciNet MATH Google Scholar
DAVIES, L., and GATHER, U. (1993), “The Identification of Multiple Outliers”, Journal of the American Statistical Association, 88(423), 782–792.
Article MathSciNet MATH Google Scholar
DEMPSTER, A., LAIRD, N., and RUBIN, D. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 39(1), 1–38.
MathSciNet MATH Google Scholar
DESARBO, W.S., and CRON, W.L. (1988), “A Maximum Likelihood Methodology for Clusterwise Linear Regression”, Journal of Classification, 5(2), 249–282.
Article MathSciNet MATH Google Scholar
FRALEY, C., RAFTERY, A.E., MURPHY, T.B., and SCRUCCA, L. (2012), “mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation”, Technical Report 597, Department of Statistics, University of Washington, Seattle, Washington, USA.
FRÜHWIRTH-SCHNATTER, S. (2006), Finite Mixture and Markov Switching Models, New York: Springer.
MATH Google Scholar
GALIMBERTI, G., and SOFFRITTI, G. (2014), “A Multivariate Linear Regression Analysis Using Finite Mixtures of t Distributions”, Computational Statistics and Data Analysis, 71, 138–150.
Article MathSciNet Google Scholar
GARCÍA-ESCUDERO, L.A., GORDALIZA, A., MAYO-ISCAR, A., and SAN MARTIN, R. (2010), “Robust Clusterwise Linear Regression Through Trimming”, Computational Statistics and Data Analysis, 54(12), 3057–3069.
Article MathSciNet MATH Google Scholar
GARCÍA-ESCUDERO, L.A., GORDALIZA, A., SAN MARTIN, R., VAN AELST, S., and ZAMAR, R. (2009), “Robust Linear Clustering”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1), 301–318.
Article MathSciNet MATH Google Scholar
GERSHENFELD, N. (1997), “Nonlinear Inference and Cluster-Weighted Modeling”, Annals of the New York Academy of Sciences, 808(1), 18–24.
Article Google Scholar
GÓMEZ, E., GÓMEZ-VIILEGAS, M.A., and MARIN, J.M. (1998), “A Multivariate Generalization of the Power Exponential Family of Distributions”, Communications in Statistics-Theory and Methods, 27(3), 589–600.
Article MathSciNet MATH Google Scholar
HARRINGTON, J. (2012), lga: “Tools for Linear Grouping Analysis (LGA)”, R package version 1.1-1, available at https://cran.r-project.org/web/packages/lga/index.html.
HARTIGAN, J.A. (1985), “Statistical Theory in Clustering”, Journal of Classification, 2(1), 63–76.
Article MathSciNet MATH Google Scholar
HENNIG, C. (2000), “Identifiablity of Models for Clusterwise Linear Regression”, Journal of Classification, 17(2), 273–296.
Article MathSciNet MATH Google Scholar
HENNIG, C. (2002), “Fixed Point Clusters for Linear Regression: Computation and Comparison”, Journal of Classification, 19(2), 249–276.
Article MathSciNet MATH Google Scholar
HENNIG, C. (2004), “Breakdown Points forMaximum Likelihood Estimators of Location-Scale Mixtures”, The Annals of Statistics, 32(4), 1313–1340.
Article MathSciNet MATH Google Scholar
INGRASSIA, S. (2004), “A Likelihood-Based Constrained Algorithm For Multivariate Normal Mixture Models”, Statistical Methods and Applications, 13(2), 151–166.
Article MathSciNet Google Scholar
INGRASSIA, S., MINOTTI, S.C., and PUNZO, A. (2014), “Model-Based Clustering via Linear Cluster-Weighted Models”, Computational Statistics and Data Analysis, 71, 159–182.
Article MathSciNet Google Scholar
INGRASSIA, S., MINOTTI, S.C., and VITTADINI, G. (2012), “Local Statistical Modeling via the Cluster-Weighted Approach with Elliptical Distributions”, Journal of Classification, 29(3), 363–401.
Article MathSciNet MATH Google Scholar
INGRASSIA, S., and PUNZO, A. (2016), “Decision Boundaries for Mixtures of Regressions”, Journal of the Korean Statistical Society, 45(2), 295–306.
Article MathSciNet MATH Google Scholar
INGRASSIA, S., PUNZO, A., VITTADINI, G., and MINOTTI, S.C. (2015), “The Generalized Linear Mixed Cluster-Weighted Model”, Journal of Classification, 32(1), 85–113.
Article MathSciNet MATH Google Scholar
INGRASSIA, S., and ROCCI, R. (2007), “Constrained Monotone EM Algorithms for Finite Mixture of Multivariate Gaussians”, Computational Statistics and Data Analysis, 51(11), 5339–5351.
Article MathSciNet MATH Google Scholar
KARLIS, D., and XEKALAKI, E. (2003), “Choosing Initial Values for the EM Algorithm for Finite Mixtures”, Computational Statistics and Data Analysis, 41(3–4), 577–590.
Article MathSciNet MATH Google Scholar
LANGE, K.L., LITTLE, R.J.A., and TAYLOR, J.M.G. (1989), “Robust Statistical Modeling Using the t Distribution”, Journal of the American Statistical Association, 84(408), 881–896.
MathSciNet Google Scholar
LITTLE, R.J.A. (1988), “Robust Estimation of the Mean and Covariance Matrix from Data with Missing Values”, Applied Statistics, 37(1), 23–38.
Article MATH Google Scholar
LÜTKEPOHL, H. (1996), Handbook of Matrices, Chicester: Wiley.
MATH Google Scholar
MARDIA, K.V., KENT, J.T., and BIBBY, J.M. (1997), Multivariate Analysis, Probability and Mathematical Statistics, London: Academic Press.
MATH Google Scholar
MARUOTTI, A., and PUNZO, A. (2016), “Model-Based Time-Varying Clustering of Multivariate Longitudinal Data with Covariates and Outliers”, Computational Statistics and Data Analysis, to appear, DOI: 10.1016/j.csda.2016.05.024.
MAZZA, A., and PUNZO, A. (2017),“Mixtures of Multivariate Contaminated Normal Regression Models”, Statistical Papers, submitted.
MAZZA, A., PUNZO, A., and INGRASSIA, S. (2017), “flexCWM: A Flexible Framework for Cluster-Weighted Models”, Journal of Statistical Software, 1–27.
MCLACHLAN, G., and KRISHNAN, T. (2007), The EM Algorithm and Extensions (2nd ed.), Vol. 382, Wiley Series in Probability and Statistics, New York: John Wiley and Sons.
MCLACHLAN, G.J., and BASFORD, K.E. (1988), Mixture Models: Inference and Applications to Clustering, New York: Marcel Dekker.
MATH Google Scholar
MCLACHLAN, G.J., and Peel, D. (2000), Finite Mixture Models, New York: John Wiley and Sons.
Book MATH Google Scholar
MCNICHOLAS, P.D. (2016), Mixture Model-Based Classification, Boca Raton: Chapman and Hall/CRC Press.
Book MATH Google Scholar
MCNICHOLAS, P.D., MURPHY, T.B., MCDAID, A.F., and FROST, D. (2010), “Serial and Parallel Implementations of Model-Based Clustering via Parsimonious Gaussian Mixture Models”, Computational Statistics and Data Analysis, 54(3), 711–723.
Article MathSciNet MATH Google Scholar
MENG, X.-L., and RUBIN, D.B. (1993), “Maximum Likelihood Estimation via the ECM Algorithm: A General Framework”, Biometrika, 80(2), 267–278.
Article MathSciNet MATH Google Scholar
NEYKOV, N., FILZMOSER, P., DIMOVA, R., and NEYTCHEV, P. (2007), “Robust Fitting of Mixtures Using the Trimmed Likelihood Estimator”, Computational Statistics and Data Analysis, 52(1), 299–308.
Article MathSciNet MATH Google Scholar
PUNZO, A. (2014), “Flexible Mixture Modeling with the Polynomial Gaussian Cluster-Weighted Model”, Statistical Modelling, 14(3), 257–291.
Article MathSciNet Google Scholar
PUNZO, A., BROWNE, R.P., and MCNICHOLAS, P.D. (2016), “Hypothesis Testing for MixtureModel Selection”, Journal of Statistical Computation and Simulation, 86(14), 2797–2818.
Article MathSciNet Google Scholar
PUNZO, A., and INGRASSIA, S. (2013), “On the Use of the Generalized Linear Exponential Cluster-Weighted Model to Assess Local Linear Independence in Bivariate Data”, QdS - Journal of Methodological and Applied Statistics, 15, 131–144.
Google Scholar
PUNZO, A., and INGRASSIA, S. (2015), “Parsimonious Generalized Linear Gaussian Cluster-Weighted Models”, in Advances in Statistical Models for Data Analysis. Studies in Classification, Data Analysis and Knowledge Organization, eds. I. Morlini, T. Minerva, and M. Vichi, Switzerland: Springer International Publishing, pp. 201–209.
Chapter Google Scholar
PUNZO, A., and INGRASSIA, S. (2016), “Clustering Bivariate Mixed-Type Data via the Cluster-Weighted Model”, Computational Statistics, 31(3), 989–1013.
Article MathSciNet MATH Google Scholar
PUNZO, A., and MARUOTTI, A. (2016), “Clustering Multivariate Longitudinal Observations: The Contaminated Gaussian Hidden Markov Model”, Journal of Computational and Graphical Statistics, 25(4), 1097–1116.
Article MathSciNet Google Scholar
PUNZO, A., MAZZA, A., and MCNICHOLAS, P.D. (2017),“ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions”, Journal of Statistical Software, 1–25.
PUNZO, A., and MCNICHOLAS, P.D. (2014), “Robust High-Dimensional Modeling with the Contaminated Gaussian Distribution”, arXiv.org e-print 1408.2128, available at http://arxiv.org/abs/1408.2128.
PUNZO, A., and MCNICHOLAS, P.D. (2016), “Parsimonious Mixtures of Multivariate Contaminated Normal Distributions”, Biometrical Journal, 58(6), 1506–1537.
Article MATH Google Scholar
R CORE TEAM (2013), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, available at http://www.Rproject.org/
RITTER, G. (2015), Robust Cluster Analysis and Variable Selection (Vol. 137), Chapman and Hall/CRC Monographs on Statistics and Applied Probability, Boca Raton: CRC Press.
ROUSSEEUW, P.J., and DRIESSEN, K.V. (1999), “A Fast Algorithm for the Minimum Covariance Determinant Estimator”, Technometrics, 41(3), 212–223.
Article Google Scholar
ROUSSEEUW, P.J., and LEROY, A.M. (2005), Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, Wiley.
ROUSSEEUW, P.J., and VAN ZOMEREN, B.C. (1990), “Unmasking Multivariate Outliers and Leverage Points”, Journal of the American Statistical Association, 85(411), 633–639.
Article Google Scholar
SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, The Annals of Statistics, 6(2), 461–464.
Article MathSciNet MATH Google Scholar
SEO, B., and KIM, D. (2012), “Root Selection in Normal Mixture Models”, Computational Statistics and Data Analysis, 56(8), 2454–2470.
Article MathSciNet MATH Google Scholar
SONG, W., YAO, W., and XING, Y. (2014), “Robust Mixture Regression Model Fitting by Laplace Distribution”, Computational Statistics and Data Analysis, 71, 128–137.
Article MathSciNet Google Scholar
STEPHENS, M. (2000), “Dealing with Label Switching in Mixture Models”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4), 795–809.
Article MathSciNet MATH Google Scholar
SUBEDI, S., PUNZO, A., INGRASSIA, S., and MCNICHOLAS, P.D. (2013), “Clustering and Classification via Cluster-Weighted Factor Analyzers”, Advances in Data Analysis and Classification, 7(1), 5–40.
Article MathSciNet MATH Google Scholar
SUBEDI, S., PUNZO, A., INGRASSIA, S., and MCNICHOLAS, P.D. (2015), “Cluster-Weighted t-Factor Analyzers for Robust Model-Based Clustering and Dimension Reduction”, Statistical Methods & Applications, 24(4), 623–649.
Article MathSciNet MATH Google Scholar
TITTERINGTON, D.M., SMITH, A.F.M., and MAKOV, U.E. (1985), Statistical Analysis of Finite Mixture Distributions, New York: John Wiley & Sons.
MATH Google Scholar
TUKEY, J.W. (1960), “A Survey of Sampling from Contaminated Distributions”, in Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford Studies in Mathematics and Statistics, ed. I. Olkin, California: Stanford University Press, Ch. 39, pp. 448–485.
VAN AELST, S., WANG, X.S., ZAMAR, R.H., and ZHU, R. (2006), “Linear Grouping Using Orthogonal Regression”, Computational Statistics and Data Analysis, 50(5), 1287–1312.
Article MathSciNet MATH Google Scholar
WEDEL, M. (2002), “Concomitant Variables in Finite Mixture Models”, Statistica Neerlandica, 56(3), 362–375.
Article MathSciNet MATH Google Scholar
YAO, W. (2012), “Model Based Labeling for Mixture Models”, Statistics and Computing, 22(2), 337–347.
Article MathSciNet MATH Google Scholar
YAO, W., and LINDSAY, B.G. (2009), “Bayesian Mixture Labeling by Highest Posterior Density”, Journal of the American Statistical Association, 104(486), 758–767.
Article MathSciNet MATH Google Scholar
YAO, W., WEI, Y., and YU, C. (2014), “Robust Mixture Regression Using the t-Distribution”, Computational Statistics and Data Analysis, 71, 116–127.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Economics and Business, University of Catania, Catania, Italy
Antonio Punzo
McMaster University, Hamilton, Canada
Paul. D. McNicholas

Authors

Antonio Punzo
View author publications
You can also search for this author in PubMed Google Scholar
Paul. D. McNicholas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Punzo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Punzo, A., McNicholas, P.D. Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model. J Classif 34, 249–293 (2017). https://doi.org/10.1007/s00357-017-9234-x

Download citation

Published: 20 June 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s00357-017-9234-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Abstract

Access this article

Similar content being viewed by others

On the Use of the Sub-Gaussian $$\alpha $$ -Stable Distribution in the Cluster-Weighted Model

Seemingly unrelated clusterwise linear regression for contaminated data

Robust estimation of mixtures of regressions with random covariates, via trimming and constraints

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Abstract

Access this article

Similar content being viewed by others

On the Use of the Sub-Gaussian $$\alpha $$ -Stable Distribution in the Cluster-Weighted Model

Seemingly unrelated clusterwise linear regression for contaminated data

Robust estimation of mixtures of regressions with random covariates, via trimming and constraints

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation