A new multiple outliers identification method in linear regression

Abstract

A new method for multiple outliers identification in linear regression models is developed. It is relatively simple and easy to use. The method is based on a result giving asymptotic properties of extreme studentized residuals. This result is proved under rather general conditions on estimation procedure and covariate distribution. An extensive simulation study shows that the proposed method has superior performance as compared to various existing methods in terms of masking and swamping values. Advantage of the method is particularly visible in case of large datasets and (or) large numbers of outliers. The analysis of several well-known real data examples confirms that in most cases the new method identifies outliers better than other commonly used methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. 1.

    http://rosa.unipr.it/FSDA/guide.html.

References

  1. Atkinson A (1994) Fast very robust methods for the detection of multiple outliers. J Am Stat Assoc 89(428):1329–1339

    Article  Google Scholar 

  2. Atkinson A, Riani M (2012) Robust diagnostic regression analysis. Springer, New York

    Google Scholar 

  3. Barnett V, Lewis T (1974) Outliers in statistical data, 3rd edn. Wiley, Chichester

    Google Scholar 

  4. Billor N, Hadi AS, Velleman PF (2000) Bacon: blocked adaptive computationally efficient outlier nominators. Comput Stat Data Anal 34(3):279–298

    Article  Google Scholar 

  5. Brownlee KA (1965) Statistical theory and methodology in science and engineering, vol 150. Wiley, New York

    Google Scholar 

  6. Chatterjee S, Hadi AS (2015) Regression analysis by example. Wiley, New York

    Google Scholar 

  7. Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18

    MathSciNet  MATH  Google Scholar 

  8. Cook RD (1979) Influential observations in linear regression. J Am Stat Assoc 74(365):169–174

    MathSciNet  Article  Google Scholar 

  9. David B, Kuh E, Welsch R (1980) Regression diagnostics: identifying influential data and sources of collinearity. Wiley, New York

    Google Scholar 

  10. Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782

    MathSciNet  Article  Google Scholar 

  11. De Haan L, Ferreira A (2007) Extreme value theory: an introduction. Springer, New York

    Google Scholar 

  12. Fox J (1991) Regression diagnostics: an introduction, vol 79. Sage, Newbury Park

    Google Scholar 

  13. Hadi AS (1992) A new measure of overall potential influence in linear regression. Comput Stat Data Anal 14(1):1–27

    Article  Google Scholar 

  14. Hadi AS, Simonoff JS (1993) Procedures for the identification of multiple outliers in linear models. J Am Stat Assoc 88(424):1264–1272

    MathSciNet  Article  Google Scholar 

  15. Hadi AS, Imon AR, Werner M (2009) Detection of outliers. Wiley Interdiscip Rev Comput Stat 1(1):57–70

    Article  Google Scholar 

  16. Kaneko H (2018) Automatic outlier sample detection based on regression analysis and repeated ensemble learning. Chemometr Intell Lab Syst 177:74–82

    Article  Google Scholar 

  17. Kuhnt S, Rehage A (2013) The concept of \(\alpha \)-outliers in structured data situations. In: Becker C, Fried R, Kuhnt S (eds) Robustness and complex data structures. Springer, Berlin, pp 85–101

    Google Scholar 

  18. Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Conceicao ELT, Anna di Palma M (2018) robustbase: basic robust statistics. http://robustbase.r-forge.r-project.org/

  19. Nurunnabi AAM, Dai H (2012) Robust-diagnostic regression: a prelude for inducing reliable knowledge from regression. In: Dai H, Liu JNK, Smirnov E (eds) Reliable knowledge discovery. Springer, New York, pp 69–92

    Google Scholar 

  20. Park CG, Kim I (2018) Outlier detection using difference-based variance estimators in multiple regression. Commun Stat Theor Methods 47(24):5986–6001

    MathSciNet  Article  Google Scholar 

  21. Peña D (2005) A new statistic for influence in linear regression. Technometrics 47(1):1–12

    MathSciNet  Article  Google Scholar 

  22. R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/

  23. Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32(9):929–946

    MathSciNet  Article  Google Scholar 

  24. Riani M, Atkinson AC (2000) Robust diagnostic data analysis: transformations in regression. Technometrics 42(4):384–394

    MathSciNet  Article  Google Scholar 

  25. Riani M, Perrotta D, Torti F (2012) FSDA: a matlab toolbox for robust analysis and interactive data exploration. Chemometr Intell Lab Syst 116:17–32

    Article  Google Scholar 

  26. Riani M, Corbellini A, Atkinson AC (2018) The use of prior information in very robust regression for fraud detection. Int Stat Rev 86:205–218

    MathSciNet  Article  Google Scholar 

  27. Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880

    MathSciNet  Article  Google Scholar 

  28. Rousseeuw PJ, Hubert M (2018) Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov 8(2):e1236

    Article  Google Scholar 

  29. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection, vol 589. Wiley, New York

    Google Scholar 

  30. Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639

    Article  Google Scholar 

  31. She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647

    MathSciNet  Article  Google Scholar 

  32. Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32(3):1–47

    Article  Google Scholar 

  33. Wang T, Li Q, Chen B, Li Z (2018) Multiple outliers detection in sparse high-dimensional regression. J Stat Comput Simul 88(1):89–107

    MathSciNet  Article  Google Scholar 

  34. Welsch RE, Kuh E (1977) Linear regression diagnostics. Technical report 173, National Bureau of Economic Research

  35. Zani S, Riani M, Corbellini A (1998) Robust bivariate boxplots and multiple outlier detection. Comput Stat Data Anal 28(3):257–270

    Article  Google Scholar 

Download references

Acknowledgements

The authors are thankful to the Editor-in-Chief, the Associate Editor and the anonymous Reviewers for their valuable constructive comments which lead to an improved version of the manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Vilijandas Bagdonavičius.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 206 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bagdonavičius, V., Petkevičius, L. A new multiple outliers identification method in linear regression. Metrika 83, 275–296 (2020). https://doi.org/10.1007/s00184-019-00731-8

Download citation

Keywords

  • Outlier identification
  • Linear regression
  • Multiple outliers
  • Outlier region
  • Robust estimators