Abstract
We propose a simple affine equivariant clustering method, based on the idea of best linear classification, for samples from a mixture of two multivariate normal distributions with different mean vectors but proportional covariance matrices. To ameliorate the curse of dimensionality, a non-parametric approach to find candidates for a best linear discriminant function is presented. By using simulation studies and a real example, we show that for large samples in high dimensions, the proposed method can be a useful supplement to general-purpose multivariate outlier detection methods.
Similar content being viewed by others
References
Anderson TW, Bahadur RR (1962) Classification into two multivariate normal distributions with different covariance matrices. Ann Math Stat 33: 420–431
Becker C, Gather U (2001) The largest nonidentifiable outlier: a comparison of multivariate simultaneous outlier identification rules. Comput Stat Data Anal 36: 119–127
Béguin C (2002) Outlier detection in multivariate data. Master’s thesis, University of Neuchâtel
Billor N, Hadi AS, Velleman PF (2000) BACON: blocked adaptive computationally efficient outlier nominators. Comput Stat Data Anal 34: 279–298
Caussinus H, Fekri M, Hakam S, Ruiz-Gazen A (2003) A monitoring display of multivariate outliers. Comput Stat Data Anal 44: 237–252
Ciuperca G, Ridolfi A, Idier J (2003) Penalized maximum likelihood estimator for normal mixtures. Scand J Stat 30: 45–59
Cohen AC (1967) Estimation in mixtures of two normal distributions. Technometrics 9: 15–28
Croux C, Haesbroeck G (2000) Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika 87: 603–618
Croux C, Haesbroeck G (2002) A note on finite-sample efficiencies of estimators for the minimum volume ellipsoid. J Stat Comput Simul 72: 585–596
Croux C, Haesbroeck G, Rousseeuw P (2002) Location adjustment for the minimum volume ellipsoid estimator. Stat Comput 12: 191–200
Dunn CL (1992) Algorithm AS 276: normal combinatoric classification. Appl Stat 41: 483–496
Fraley C, Raftery AE (2002) Model-based clustering, discrimination analysis, and density estimation. J Am Stat Assoc 97: 611–631
Furman WD, Linsay BG (1994) Measuring the relative effectiveness of moment estimators as starting values in maximizing likelihoods. Comput Stat Data Anal 17: 493–507
Gervini D (2003) A robust and efficient adaptive reweighted estimator of multivariate location and scatter. J Multivar Anal 84: 116–144
Hadi AS (1994) A modification of a method for the detection of outliers in multivariate samples. J R Stat Soc Ser B 56: 393–396
Hardin J, Rocke DM (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput Stat Data Anal 44: 625–638
Hawkins DM (1980) The identification of outliers. Chapman and Hall, London
Hawkins DM, Olive DJ (2002) Inconsistency of resampling algorithms for high breakdown regression estimators and a new algorithm. J Am Stat Assoc 97: 136–159
Hoaglin DC, Mosteller F, Tukey JW (1983) Understanding robust and exploratory data analysis. Wiley, New York
Hubert M, Rousseeuw PJ, Vanden Branden K (2005) ROBPCA: a new approach to robust principal component analysis. Technometrics 47: 64–79
Hubert M, Rousseeuw PJ, Verboven S (2002) A fast robust method for principal components with applications to chemometrics. Chemom Intell Lab Syst 60: 101–111
Juan J, Prieto FJ (1995) A subsampling method for the computation of multivariate estimators with high breakdown point. J Comput Graph Stat 4: 319–334
Juan J, Prieto FJ (2001) Using angles to identify concentrated multivariate outliers. Technometrics 43: 311–322
Kosinski AS (1999) A procedure for the detection of multivariate outliers. Comput Stat Data Anal 29: 145–161
Li G, Chen Z (1985) Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. J Am Stat Soc 80: 759–766
Li G, Zhang J (1998) Sphering and its properties. Sankhyã Ser A 60: 119–133
Lopuhaä HP (1989) On the relation between S-estimators and M-estimators of multivariate location and covariance. Ann Stat 17: 1662–1683
Lopuhaä HP (1999) Asymptotics of reweighted estimators of multivariate location and scatter. Ann Stat 27: 1638–1665
Markatou M (2000) Mixture models, robustness, and the weighted likelihood methodology. Biometrics 56: 483–486
Maronna RA, Yohai VJ (1995) The behavior of the Stahel-Donoho robust multivariate estimator. J Am Stat Soc 90: 330–341
Maronna RA, Zamar RH (2002) Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44: 307–317
Mehrotra DV (1995) Robust elementwise estimation of a dispersion matrix. Biometrics 51: 1344–1351
Merz P (2003) An iterated local search approach for minimum sum-of-squares clustering. In: Proceedings of the 5th international symposium on intelligent data analysis, Berlin, pp 286–296
Olive DJ (2004) A resistant estimator of multivariate location and dispersion. Comput Stat Data Anal 46: 93–102
Pearson K (1894) Contributions to the mathematical theory of evolution. Philos Trans R Soc 185: 71–110
Peña D, Prieto FJ (2000) The Kurtosis coefficient and the linear discriminant function. Stat Probab Lett 49: 257–261
Peña D, Prieto FJ (2001a) Cluster identification using projections. J Am Stat Assoc 96: 1433–1445
Peña D, Prieto FJ (2001b) Multivariate outlier detection and robust covariance matrix estimation. Technometrics 43: 286–300
Priebe C, Marchette D, Healy D (2002) Integrated sensing and processing for statistical pattern recognition. In: Rockmore D, Healy D Jr (eds) Modern signal processing. Cambridge University Press, London, pp 223–246
Reiners T (1998) Maximum likelihood clustering of data sets using multilevel, parallel heuristic. Master’s thesis, Technische Universität Braunschweig
Reyen SS (2004) Constructive clustering analysis. PhD thesis, George Mason University
Reyen SS, Miller JJ (2005) The moment of inertia and the linear discriminant function. Stat Probab Lett 71: 39–46
Ridolfi A, Idier J (2000) Penalized maximum likelihood estimation for univariate normal mixture distributions. In: Proceedings of the 20th international workshop on Bayesian inference and maximum entropy methods in science and engineering (MaxEnt), American Institute of Physics, Gif-sur-Yvette, France
Rocke DM (1996) Robustness properties of S-estimators of multivariate location and shape in high dimension. Ann Stat 24: 1327–1345
Rocke DM (1998a) Constructive statistics: estimators, algorithms, and asymptotics. In: 30th Symposium on the interface: computing science and statistics, Minneapolis, Minnesota, vol 30, pp 3–14
Rocke DM (1998b) A perspective on statistical tools for data mining applications. In: Proceedings of the second international conference on practical application of knowledge discovery and data minings, London, pp 313–318
Rocke DM, Woodruff DL (1993a) Computation of robust estimates of multivariate location and shape. Stat Neerl 47: 27–42
Rocke DM, Woodruff DL (1993b) Heuristic search algorithms for the minimum volume ellipsoid. J Comput Graph Stat 2: 69–95
Rocke DM, Woodruff DL (1994) Computable robust estimation of multivariate location and shape in high dimension using compound estimators. J Am Stat Assoc 89: 888–896
Rocke DM, Woodruff DL (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91: 1047–1061
Rocke DM, Woodruff DL (1997) Robust estimation of multivariate location and shape. J Stat Plann Inference 57: 245–255
Rocke DM, Woodruff DL (1999) A synthesis of outlier detection and cluster identification (Preprint)
Rocke DM, Woodruff DL (2001) Discussion of multivariate outlier detection and robust covariance matrix estimation. Technometrics 43: 300–303
Rohlf FJ (1975) Generalization of the gap test for the detection of multivariate outliers. Biometrics 31: 93–101
Rousseeuw PJ, van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimators. Technometrics 41: 212–223
Ruppert D (1992) Computing S estimators for regression and multivariate location/shape. J Comput Graph Stat 1: 253–270
Schott JR (1997) Matrix analysis for statistics. Wiley, New York
Werner M (2003) Identification of multivariate outliers in large data sets. PhD thesis, University of Colorado at Denver
Author information
Authors and Affiliations
Corresponding author
Additional information
Salem S. Reyen was supported by the Defense Advanced Research Project Agency through cooperative agreement 8105-48267 with the John Hopkins University.
Rights and permissions
About this article
Cite this article
Reyen, S.S., Miller, J.J. & Wegman, E.J. Separating a mixture of two normals with proportional covariances. Metrika 70, 297–314 (2009). https://doi.org/10.1007/s00184-008-0193-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-008-0193-4