Compstat pp 291-296 | Cite as

Detection of Outliers in Multivariate Data: A Method Based on Clustering and Robust Estimators

  • Carla M. Santos-Pereira
  • Ana M. Pires

Abstract

Outlier identification is important in many applications of multivariate analysis. Either because there is some specific interest in finding anomalous observations or as a pre-processing task before the application of some multivariate method, in order to preserve the results from possible harmful effects of those observations. It is also of great interest in supervised classification (or discriminant analysis) if, when predicting group membership, one wants to have the possibility of labelling an observation as “does not belong to any of the available groups”. The identification of outliers in multivariate data is usually based on Mahalanobis distance. The use of robust estimates of the mean and the covariance matrix is advised in order to avoid the masking effect (Rousseeuw and Leroy, 1985; Rousseeuw and von Zomeren, 1990; Rocke and Woodruff, 1996; Becker and Gather, 1999). However, the performance of these rules is still highly dependent of multivariate normality of the bulk of the data. The aim of the method here described is to remove this dependence.

Keywords

Multivariate analysis Outlier detection Robust estimation Clustering Supervised classification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banfield, J.D. and Raftery, A.E. (1992). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–822.MathSciNetCrossRefGoogle Scholar
  2. Becker, C. and Gather, U. (1999). The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association, 94, 947–955.MathSciNetMATHCrossRefGoogle Scholar
  3. Becker, C. and Gather, U. (2001). The largest nonidentifiable outlier: a comparison of multivariate simultaneous outlier identification rules. Computational Statistics and Data Analysis, 36, 119–127.MathSciNetMATHCrossRefGoogle Scholar
  4. Davies, P.L. and Gather, U. (1993). The identification of multiple outliers. Journal of the American Statistical Association, 88, 782–792.MathSciNetMATHCrossRefGoogle Scholar
  5. Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An In-troduction to Cluster Analysis. New York: Wiley.CrossRefGoogle Scholar
  6. Kosinski, A.S. (1998). A procedure for the detection of multivariate outliers. Computational Statistics and Data Analysis, 29, 145–161.MATHCrossRefGoogle Scholar
  7. Rocke, D.M. and Woodruff, D.L. (1996). Identification of outliers in multi-variate data. Journal of the American Statistical Association, 91, 1047–1061.MathSciNetMATHCrossRefGoogle Scholar
  8. Rousseeuw, P.J. (1985). Multivariate estimation with high breakdown point. In: Mathematical Statistics and Applications, Volume B, eds. W. Grossman, G. Pflug, I. Vincze and W. Werz 283–297. Dordrecht: Reidel.CrossRefGoogle Scholar
  9. Rousseeuw, P.J. and Leroy, A.M. (1985). Robust Regression and Outlier Detection. New York: Wiley.Google Scholar
  10. Rousseeuw, P.J. and von Zomeren, B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85, 633–639.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Carla M. Santos-Pereira
    • 1
  • Ana M. Pires
    • 2
  1. 1.Universidade Portucalense Infante D. Henrique, Oporto, Portugal and Applied Mathematics Centre, ISTTechnical University of LisbonPortugal
  2. 2.Department of Mathematics and Applied Mathematics Centre, ISTTechnical University of LisbonPortugal

Personalised recommendations