Abstract
A new method for multiple outliers identification in linear regression models is developed. It is relatively simple and easy to use. The method is based on a result giving asymptotic properties of extreme studentized residuals. This result is proved under rather general conditions on estimation procedure and covariate distribution. An extensive simulation study shows that the proposed method has superior performance as compared to various existing methods in terms of masking and swamping values. Advantage of the method is particularly visible in case of large datasets and (or) large numbers of outliers. The analysis of several well-known real data examples confirms that in most cases the new method identifies outliers better than other commonly used methods.
Similar content being viewed by others
References
Atkinson A (1994) Fast very robust methods for the detection of multiple outliers. J Am Stat Assoc 89(428):1329–1339
Atkinson A, Riani M (2012) Robust diagnostic regression analysis. Springer, New York
Barnett V, Lewis T (1974) Outliers in statistical data, 3rd edn. Wiley, Chichester
Billor N, Hadi AS, Velleman PF (2000) Bacon: blocked adaptive computationally efficient outlier nominators. Comput Stat Data Anal 34(3):279–298
Brownlee KA (1965) Statistical theory and methodology in science and engineering, vol 150. Wiley, New York
Chatterjee S, Hadi AS (2015) Regression analysis by example. Wiley, New York
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
Cook RD (1979) Influential observations in linear regression. J Am Stat Assoc 74(365):169–174
David B, Kuh E, Welsch R (1980) Regression diagnostics: identifying influential data and sources of collinearity. Wiley, New York
Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782
De Haan L, Ferreira A (2007) Extreme value theory: an introduction. Springer, New York
Fox J (1991) Regression diagnostics: an introduction, vol 79. Sage, Newbury Park
Hadi AS (1992) A new measure of overall potential influence in linear regression. Comput Stat Data Anal 14(1):1–27
Hadi AS, Simonoff JS (1993) Procedures for the identification of multiple outliers in linear models. J Am Stat Assoc 88(424):1264–1272
Hadi AS, Imon AR, Werner M (2009) Detection of outliers. Wiley Interdiscip Rev Comput Stat 1(1):57–70
Kaneko H (2018) Automatic outlier sample detection based on regression analysis and repeated ensemble learning. Chemometr Intell Lab Syst 177:74–82
Kuhnt S, Rehage A (2013) The concept of \(\alpha \)-outliers in structured data situations. In: Becker C, Fried R, Kuhnt S (eds) Robustness and complex data structures. Springer, Berlin, pp 85–101
Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Conceicao ELT, Anna di Palma M (2018) robustbase: basic robust statistics. http://robustbase.r-forge.r-project.org/
Nurunnabi AAM, Dai H (2012) Robust-diagnostic regression: a prelude for inducing reliable knowledge from regression. In: Dai H, Liu JNK, Smirnov E (eds) Reliable knowledge discovery. Springer, New York, pp 69–92
Park CG, Kim I (2018) Outlier detection using difference-based variance estimators in multiple regression. Commun Stat Theor Methods 47(24):5986–6001
Peña D (2005) A new statistic for influence in linear regression. Technometrics 47(1):1–12
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32(9):929–946
Riani M, Atkinson AC (2000) Robust diagnostic data analysis: transformations in regression. Technometrics 42(4):384–394
Riani M, Perrotta D, Torti F (2012) FSDA: a matlab toolbox for robust analysis and interactive data exploration. Chemometr Intell Lab Syst 116:17–32
Riani M, Corbellini A, Atkinson AC (2018) The use of prior information in very robust regression for fraud detection. Int Stat Rev 86:205–218
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880
Rousseeuw PJ, Hubert M (2018) Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov 8(2):e1236
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection, vol 589. Wiley, New York
Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639
She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647
Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32(3):1–47
Wang T, Li Q, Chen B, Li Z (2018) Multiple outliers detection in sparse high-dimensional regression. J Stat Comput Simul 88(1):89–107
Welsch RE, Kuh E (1977) Linear regression diagnostics. Technical report 173, National Bureau of Economic Research
Zani S, Riani M, Corbellini A (1998) Robust bivariate boxplots and multiple outlier detection. Comput Stat Data Anal 28(3):257–270
Acknowledgements
The authors are thankful to the Editor-in-Chief, the Associate Editor and the anonymous Reviewers for their valuable constructive comments which lead to an improved version of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Bagdonavičius, V., Petkevičius, L. A new multiple outliers identification method in linear regression. Metrika 83, 275–296 (2020). https://doi.org/10.1007/s00184-019-00731-8
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-019-00731-8