Detection of Outliers in Multivariate Data: A Method Based on Clustering and Robust Estimators
Abstract
Outlier identification is important in many applications of multivariate analysis. Either because there is some specific interest in finding anomalous observations or as a pre-processing task before the application of some multivariate method, in order to preserve the results from possible harmful effects of those observations. It is also of great interest in supervised classification (or discriminant analysis) if, when predicting group membership, one wants to have the possibility of labelling an observation as “does not belong to any of the available groups”. The identification of outliers in multivariate data is usually based on Mahalanobis distance. The use of robust estimates of the mean and the covariance matrix is advised in order to avoid the masking effect (Rousseeuw and Leroy, 1985; Rousseeuw and von Zomeren, 1990; Rocke and Woodruff, 1996; Becker and Gather, 1999). However, the performance of these rules is still highly dependent of multivariate normality of the bulk of the data. The aim of the method here described is to remove this dependence.
Keywords
Multivariate analysis Outlier detection Robust estimation Clustering Supervised classificationPreview
Unable to display preview. Download preview PDF.
References
- Banfield, J.D. and Raftery, A.E. (1992). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–822.MathSciNetCrossRefGoogle Scholar
- Becker, C. and Gather, U. (1999). The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association, 94, 947–955.MathSciNetMATHCrossRefGoogle Scholar
- Becker, C. and Gather, U. (2001). The largest nonidentifiable outlier: a comparison of multivariate simultaneous outlier identification rules. Computational Statistics and Data Analysis, 36, 119–127.MathSciNetMATHCrossRefGoogle Scholar
- Davies, P.L. and Gather, U. (1993). The identification of multiple outliers. Journal of the American Statistical Association, 88, 782–792.MathSciNetMATHCrossRefGoogle Scholar
- Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An In-troduction to Cluster Analysis. New York: Wiley.CrossRefGoogle Scholar
- Kosinski, A.S. (1998). A procedure for the detection of multivariate outliers. Computational Statistics and Data Analysis, 29, 145–161.MATHCrossRefGoogle Scholar
- Rocke, D.M. and Woodruff, D.L. (1996). Identification of outliers in multi-variate data. Journal of the American Statistical Association, 91, 1047–1061.MathSciNetMATHCrossRefGoogle Scholar
- Rousseeuw, P.J. (1985). Multivariate estimation with high breakdown point. In: Mathematical Statistics and Applications, Volume B, eds. W. Grossman, G. Pflug, I. Vincze and W. Werz 283–297. Dordrecht: Reidel.CrossRefGoogle Scholar
- Rousseeuw, P.J. and Leroy, A.M. (1985). Robust Regression and Outlier Detection. New York: Wiley.Google Scholar
- Rousseeuw, P.J. and von Zomeren, B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85, 633–639.CrossRefGoogle Scholar