Detection of multivariate outliers in business survey data with incomplete information

Todorov, Valentin; Templ, Matthias; Filzmoser, Peter

doi:10.1007/s11634-010-0075-2

Detection of multivariate outliers in business survey data with incomplete information

Regular Article
Published: 27 October 2010

Volume 5, pages 37–56, (2011)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Valentin Todorov¹,
Matthias Templ^2,3 &
Peter Filzmoser³

516 Accesses
27 Citations
Explore all metrics

Abstract

Many different methods for statistical data editing can be found in the literature but only few of them are based on robust estimates (for example such as BACON-EEM, epidemic algorithms (EA) and transformed rank correlation (TRC) methods of Béguin and Hulliger). However, we can show that outlier detection is only reasonable if robust methods are applied, because the classical estimates are themselves influenced by the outliers. Nevertheless, data editing is essential to check the multivariate data for possible data problems and it is not deterministic like the traditional micro editing where all records are extensively edited manually using certain rules/constraints. The presence of missing values is more a rule than an exception in business surveys and poses additional severe challenges to the outlier detection. First we review the available multivariate outlier detection methods which can cope with incomplete data. In a simulation study, where a subset of the Austrian Structural Business Statistics is simulated, we compare several approaches. Robust methods based on the Minimum Covariance Determinant (MCD) estimator, S-estimators and OGK-estimator as well as BACON-BEM provide the best results in finding the outliers and in providing a low false discovery rate. Many of the discussed methods are implemented in the R package \({\tt{rrcovNA}}\) which is available from the Comprehensive R Archive Network (CRAN) at http://www.CRAN.R-project.org under the GNU General Public License.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing Values and Directional Outlier Detection in Model-Based Clustering

Article 31 October 2023

Outliers in official statistics

Article Open access 24 October 2020

A new robust ratio estimator by modified Cook’s distance for missing data imputation

Article 06 July 2022

References

Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York
MATH Google Scholar
Béguin C, Hulliger B (2004) Multivariate outlier detection in incomplete survey data: the epidemic algorithm and transformed rank correlations. J R Stat Soc Ser B (Stat Methodol) 127(2): 275–294
Google Scholar
Béguin C, Hulliger B (2008) The BACON-EEM algorithm for multivariate outlier detection in incomplete survey data. Surv Methodol 34(1): 91–103
Google Scholar
Billor N, Hadi AS, Vellemann PF (2000) Bacon: blocked adaptative computationally-efficient outlier nominators. Comput Stat Data Anal 34(3): 279–298
Article MATH Google Scholar
Campbell NA (1989) Bushfire maping using NOAA AVHRR data. Technical report, CSIRO
Cerioli A, Riani M, Atkinson AC (2009) Controlling the size of multivariate outlier tests with the MCD estimator of scatter. Stat Comput 19(3): 341–353
Article Google Scholar
Chambers RL (1986) Outlier robust finite population estimation. J Am Stat Assoc 81: 1063–1069
Article MathSciNet MATH Google Scholar
Copt S, Victoria-Feser MP (2004) Fast algorithms for computing high breakdown covariance matrices with missing data. In: Hubert M, Pison G, Struyf A, Van Aelst S (eds) Theory and applications of recent robust methods, statistics for industry and technology series. Birkhauser, Basel
Google Scholar
Croux C, Haesbroeck G (1999) Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. J Multivariate Analy 71: 161–190
Article MathSciNet MATH Google Scholar
De Waal T (2003) Processing of erroneous and unsafe data. PhD thesis, Erasmus University, Rotterdam
De Waal T (2009) Statistical data editing. In: Peffermann D, Rao C (eds) Handbook of statistics 29A. Sample surveys: design, methods and applications. Elsevier B. V., Amsterdam, pp 187–214
Chapter Google Scholar
Dempster AP, Laird MN, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Stat Methodol) 39: 1–22
MathSciNet MATH Google Scholar
Dinges G, Haitzmann M (2009) Modellbasierte Ergänzung der Konjunkturstatistik im Produzierenden Bereich; Darstellung der statistischen Grundgesamtheit im Produzierenden Bereich. Statistische Nachrichten 9:1153–1166. http://www.stat.at/web_de/downloads/methodik/kjp.pdf
Donoho DL (1982) Breakdown properties of multivariate location estimators. Technical report, Harvard University, Boston. http://www-stat.stanford.edu/~donoho/Reports/Oldies/BPMLE.pdf
EUREDIT Project (2004) Towards effective statistical editing and imputation strategies—findings of the Euredit project, vols 1 and 2. EUREDIT consortium. http://www.cs.york.ac.uk/euredit/results/results.html
Eurostat (2008) NACE Rev. 2. Statistical classification of economic activites in the European community. Eurostat, methodologies and working papers, ISBN 978-92-79-04741-1
Fellegi I, Holt D (1976) A systematic approach to automatic edit and imputation. J Am Stat Assoc 71: 17–35
Article Google Scholar
Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31: 579–587
Article Google Scholar
Filzmoser P, Maronna R, Werner M (2008) Outlier identification in high dimensions. Comput Stat Data Anal 52(3): 1694–1711
Article MathSciNet MATH Google Scholar
Franklin S, Brodeur M (1997) A practical application of a robust multivariate outlier detection method. In: Proceedings of the survey research methods section. American Statistical Association, pp 186–191. http://www.amstat.org/sections/srms/proceedings
Franklin S, Brodeur M, Thomas S (2000) Robust multivariate outlier detection using Mahalanobis’ distance and Stahel–Donoho estimators. In: ICES II, international conference on establishment surveys II
Granquist L (1990) A review of some macro-editing methods for rationalizing the editing process. In: Proceedings of the statistics Canada symposium, Ottawa, Canada, pp 225–234
Granquist L (1997) The new view on editing. Int Stat Rev 65: 381–387
Article Google Scholar
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986) Robust statistics, the approach based on infuence functions. Wiley, New York
Google Scholar
Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Graph Stat 14: 910–927
Article MathSciNet Google Scholar
Hidiroglou MA, Lavallée P (2009) Sampling and estimation in business surveys. In: Peffermann D, Rao C (eds) Handbook of statistics 29A, sample surveys: design, methods and applications. Elsevier B. V., Amsterdam, pp 441–470
Chapter Google Scholar
Huber PJ (1981) Robust statistics. Wiley, New York
Book MATH Google Scholar
Hubert M, Rousseeuw PJ, Vanden Branden K (2005) Robpca: a new approach to robust principal component analysis. Technometrics 47: 64–79
Article MathSciNet Google Scholar
Hubert M, Rousseeuw PJ, van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23: 92–119
Article Google Scholar
Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis. 5th edn. Prentice Hall, New Jersey
Google Scholar
Lawrence D, McKenzie R (2000) The general application of significance editing. J Official Stat 16: 243–253
Google Scholar
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
MATH Google Scholar
Little RJA, Smith PJ (1987) Editing and imputation for quantitative data. J Am Stat Assoc 82: 58–69
Article MathSciNet MATH Google Scholar
Lopuhaä HP (1999) Asymptotics of reweighted estimators of multivariate location and scatter. Ann Stat 27: 1638–1665
Article MATH Google Scholar
Lopuhaä HP, Rousseeuw PJ (1991) Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann Stat 19: 229–248
Article MATH Google Scholar
Luzi O, De Waal T, Hulliger B, Di Zio M, Pannekoek J, Kilchmann D, Guarnera U, Hoogland J, Manzari A, Tempelman C (2007) Recommended practices for editing and imputation in cross-sectional business surveys. Report
Maronna RA, Yohai VJ (1995) The behaviour of the Stahel-Donoho robust multivariate estimator. J Am Stat Assoc 90: 330–341
Article MathSciNet MATH Google Scholar
Maronna RA, Zamar RH (2002) Robust estimation of location and dispersion for high-dimensional datasets. Technometrics 44: 307–317
Article MathSciNet Google Scholar
Maronna RA, Martin D, Yohai V (2006) Robust statistics: theory and methods. Wiley, New York
Book MATH Google Scholar
R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org/, ISBN 3-900051-07-0
Riani M, Atkinson AC, Cerioli A (2009) Finding an unknown number of multivariate outliers. J R Stat Soc Ser B (Stat Methodol) 71(2): 447–466
Article Google Scholar
Rousseeuw PJ, Leroy AM (1987) Robust Regression and outlier detection. Wiley, New York
Book MATH Google Scholar
Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85: 633–651
Article Google Scholar
Rubin DB (1993) Discussion: statistical disclosure limitation. J Official Stat 9: 462–468
Google Scholar
Schafer J (1997) Analysis of incomplete multivariate data. Chapman and Hall, London
Book MATH Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7: 147–177
Article Google Scholar
Stahel WA (1981a) Breakdown of covariance estimators. Research Report 31, ETH Zurich, Fachgruppe für Statistik
Stahel WA (1981b) Robuste schätzungen: Infinitesimale optimalität und schätzungen von kovarianzmatrizen. PhD thesis no. 6881, Swiss Federal Institute of Technology (ETH), Zürich. http://www.e-collection.ethbib.ethz.ch/view/eth:21890
Templ M, Filzmoser P (2008) Visualization of missing values using the R-package VIM. Reserach report cs-2008-1, Department of Statistics and Probability Therory, Vienna University of Technology, Vienna
Todorov V, Filzmoser P (2009) An object oriented framework for robust multivariate analysis. J Stat Softw 32(3):1–47. http://www.jstatsoft.org/v32/i03/
Google Scholar
Vanden Branden K, Verboven S (2009) Robust data imputation. Comput Biol Chem 33(1): 7–13
Article MATH Google Scholar
Venables WN, Ripley BD (2003) Modern applied statistics with S. 4th edn. Springer, Berlin
Google Scholar
Verboven S, Vanden Branden K, Goos P (2007) Sequential imputation for missing values. Comput Biol Chem 31(5–6): 320–327
Article MATH Google Scholar
Wegman E (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc 85: 664–675
Article Google Scholar

Download references

Author information

Authors and Affiliations

United Nations Industrial Development Organization (UNIDO), Vienna International Centre, P.O. Box 300, 1400, Vienna, Austria
Valentin Todorov
Department of Methodology, Statistics Austria, Vienna University of Technology, Vienna, Austria
Matthias Templ
Department of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstr. 8-10, 1040, Vienna, Austria
Matthias Templ & Peter Filzmoser

Authors

Valentin Todorov
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Templ
View author publications
You can also search for this author in PubMed Google Scholar
Peter Filzmoser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentin Todorov.

Additional information

The views expressed herein are those of the authors and do not necessarily reflect the views of the United Nations Industrial Development Organization.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Todorov, V., Templ, M. & Filzmoser, P. Detection of multivariate outliers in business survey data with incomplete information. Adv Data Anal Classif 5, 37–56 (2011). https://doi.org/10.1007/s11634-010-0075-2

Download citation

Received: 05 February 2010
Revised: 21 August 2010
Accepted: 27 August 2010
Published: 27 October 2010
Issue Date: April 2011
DOI: https://doi.org/10.1007/s11634-010-0075-2

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of multivariate outliers in business survey data with incomplete information

Abstract

Access this article

Similar content being viewed by others

Missing Values and Directional Outlier Detection in Model-Based Clustering

Outliers in official statistics

A new robust ratio estimator by modified Cook’s distance for missing data imputation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Detection of multivariate outliers in business survey data with incomplete information

Abstract

Access this article

Similar content being viewed by others

Missing Values and Directional Outlier Detection in Model-Based Clustering

Outliers in official statistics

A new robust ratio estimator by modified Cook’s distance for missing data imputation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation