Skip to main content
Log in

A new multiple outliers identification method in linear regression

  • Published:
Metrika Aims and scope Submit manuscript

Abstract

A new method for multiple outliers identification in linear regression models is developed. It is relatively simple and easy to use. The method is based on a result giving asymptotic properties of extreme studentized residuals. This result is proved under rather general conditions on estimation procedure and covariate distribution. An extensive simulation study shows that the proposed method has superior performance as compared to various existing methods in terms of masking and swamping values. Advantage of the method is particularly visible in case of large datasets and (or) large numbers of outliers. The analysis of several well-known real data examples confirms that in most cases the new method identifies outliers better than other commonly used methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://rosa.unipr.it/FSDA/guide.html.

References

  • Atkinson A (1994) Fast very robust methods for the detection of multiple outliers. J Am Stat Assoc 89(428):1329–1339

    Article  Google Scholar 

  • Atkinson A, Riani M (2012) Robust diagnostic regression analysis. Springer, New York

    MATH  Google Scholar 

  • Barnett V, Lewis T (1974) Outliers in statistical data, 3rd edn. Wiley, Chichester

    MATH  Google Scholar 

  • Billor N, Hadi AS, Velleman PF (2000) Bacon: blocked adaptive computationally efficient outlier nominators. Comput Stat Data Anal 34(3):279–298

    Article  Google Scholar 

  • Brownlee KA (1965) Statistical theory and methodology in science and engineering, vol 150. Wiley, New York

    MATH  Google Scholar 

  • Chatterjee S, Hadi AS (2015) Regression analysis by example. Wiley, New York

    MATH  Google Scholar 

  • Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18

    MathSciNet  MATH  Google Scholar 

  • Cook RD (1979) Influential observations in linear regression. J Am Stat Assoc 74(365):169–174

    Article  MathSciNet  Google Scholar 

  • David B, Kuh E, Welsch R (1980) Regression diagnostics: identifying influential data and sources of collinearity. Wiley, New York

    MATH  Google Scholar 

  • Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782

    Article  MathSciNet  Google Scholar 

  • De Haan L, Ferreira A (2007) Extreme value theory: an introduction. Springer, New York

    MATH  Google Scholar 

  • Fox J (1991) Regression diagnostics: an introduction, vol 79. Sage, Newbury Park

    Book  Google Scholar 

  • Hadi AS (1992) A new measure of overall potential influence in linear regression. Comput Stat Data Anal 14(1):1–27

    Article  Google Scholar 

  • Hadi AS, Simonoff JS (1993) Procedures for the identification of multiple outliers in linear models. J Am Stat Assoc 88(424):1264–1272

    Article  MathSciNet  Google Scholar 

  • Hadi AS, Imon AR, Werner M (2009) Detection of outliers. Wiley Interdiscip Rev Comput Stat 1(1):57–70

    Article  Google Scholar 

  • Kaneko H (2018) Automatic outlier sample detection based on regression analysis and repeated ensemble learning. Chemometr Intell Lab Syst 177:74–82

    Article  Google Scholar 

  • Kuhnt S, Rehage A (2013) The concept of \(\alpha \)-outliers in structured data situations. In: Becker C, Fried R, Kuhnt S (eds) Robustness and complex data structures. Springer, Berlin, pp 85–101

    Chapter  Google Scholar 

  • Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Conceicao ELT, Anna di Palma M (2018) robustbase: basic robust statistics. http://robustbase.r-forge.r-project.org/

  • Nurunnabi AAM, Dai H (2012) Robust-diagnostic regression: a prelude for inducing reliable knowledge from regression. In: Dai H, Liu JNK, Smirnov E (eds) Reliable knowledge discovery. Springer, New York, pp 69–92

    Chapter  Google Scholar 

  • Park CG, Kim I (2018) Outlier detection using difference-based variance estimators in multiple regression. Commun Stat Theor Methods 47(24):5986–6001

    Article  MathSciNet  Google Scholar 

  • Peña D (2005) A new statistic for influence in linear regression. Technometrics 47(1):1–12

    Article  MathSciNet  Google Scholar 

  • R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/

  • Rahmatullah Imon A (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32(9):929–946

    Article  MathSciNet  Google Scholar 

  • Riani M, Atkinson AC (2000) Robust diagnostic data analysis: transformations in regression. Technometrics 42(4):384–394

    Article  MathSciNet  Google Scholar 

  • Riani M, Perrotta D, Torti F (2012) FSDA: a matlab toolbox for robust analysis and interactive data exploration. Chemometr Intell Lab Syst 116:17–32

    Article  Google Scholar 

  • Riani M, Corbellini A, Atkinson AC (2018) The use of prior information in very robust regression for fraud detection. Int Stat Rev 86:205–218

    Article  MathSciNet  Google Scholar 

  • Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880

    Article  MathSciNet  Google Scholar 

  • Rousseeuw PJ, Hubert M (2018) Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov 8(2):e1236

    Article  Google Scholar 

  • Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection, vol 589. Wiley, New York

    Book  Google Scholar 

  • Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639

    Article  Google Scholar 

  • She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647

    Article  MathSciNet  Google Scholar 

  • Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32(3):1–47

    Article  Google Scholar 

  • Wang T, Li Q, Chen B, Li Z (2018) Multiple outliers detection in sparse high-dimensional regression. J Stat Comput Simul 88(1):89–107

    Article  MathSciNet  Google Scholar 

  • Welsch RE, Kuh E (1977) Linear regression diagnostics. Technical report 173, National Bureau of Economic Research

  • Zani S, Riani M, Corbellini A (1998) Robust bivariate boxplots and multiple outlier detection. Comput Stat Data Anal 28(3):257–270

    Article  Google Scholar 

Download references

Acknowledgements

The authors are thankful to the Editor-in-Chief, the Associate Editor and the anonymous Reviewers for their valuable constructive comments which lead to an improved version of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vilijandas Bagdonavičius.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 206 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bagdonavičius, V., Petkevičius, L. A new multiple outliers identification method in linear regression. Metrika 83, 275–296 (2020). https://doi.org/10.1007/s00184-019-00731-8

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00184-019-00731-8

Keywords

Navigation