Skip to main content
Log in

Check your outliers! An introduction to identifying statistical outliers in R with easystats

  • Original Manuscript
  • Published:
Behavior Research Methods Aims and scope Submit manuscript

Abstract

Beyond the challenge of keeping up to date with current best practices regarding the diagnosis and treatment of outliers, an additional difficulty arises concerning the mathematical implementation of the recommended methods. Here, we provide an overview of current recommendations and best practices and demonstrate how they can easily and conveniently be implemented in the R statistical computing software, using the {performance} package of the easystats ecosystem. We cover univariate, multivariate, and model-based statistical outlier detection methods, their recommended threshold, standard output, and plotting methods. We conclude by reviewing the different theoretical types of outliers, whether to exclude or winsorize them, and the importance of transparency. A preprint of this paper is available at: 10.31234/osf.io/bu6nt.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

This paper first appeared as a preprint (https://doi.org/10.31234/osf.io/bu6nt) and is also available as an online vignette at: https://easystats.github.io/performance/articles/check_outliers. All data used in this paper uses data included with base R.

Code availability

The performance package is available at the package official website (https://easystats.github.io/performance), on CRAN (https://cran.r-project.org/package=performance), and on the R-Universe (https://easystats.r-universe.dev/performance). The source code is available on GitHub (https://github.com/easystats/performance/), and the package can be installed from CRAN with install.packages("performance"). The code to reproduce figures and all analyses in this paper is available at https://osf.io/eqja6/.

Notes

  1. Note that check_outliers() only checks numeric variables.

  2. 3.29 is an approximation of the two-tailed critical value for p < .001, obtained through qnorm(p = 1 – 0.001 / 2). We chose this threshold for consistency with the thresholds of all our other methods.

  3. Note that univariate outlier detection methods might not be the optimal way of treating reaction time outliers (Ratcliff, 1993; Van Zandt & Ratcliff, 1995).

  4. Our default threshold for the MCD method is defined by stats::qchisq(p = 1 – 0.001, df = ncol(x)), which again is an approximation of the critical value for p < .001 consistent with the thresholds of our other methods.

  5. Our default threshold for the Cook method is defined by stats::qf(0.5, ncol(x), nrow(x) - ncol(x)), which again is an approximation of the critical value for p < .001 consistent with the thresholds of our other methods. In this case, the value 0.5 represents the median of the implied F distribution for D, which allows us to flag D values that are “above average”.

  6. Some authors provide much more detailed classifications of outliers; for example, see Table 1 in Aguinis et al. (2013), for 14 different outlier definitions based on a literature review.

References

Download references

Acknowledgements

{performance} is part of the collaborative easystats ecosystem (Lüdecke et al., 2023). Thus, we thank all members of easystats, contributors, and users alike.

Funding

This research received no external funding.

Author information

Authors and Affiliations

Authors

Contributions

Writing- Original draft preparation: RT. Writing- Reviewing and Editing, Software: RT, MSB-S, IP, DL, BMW, and DM.

Corresponding author

Correspondence to Rémi Thériault.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thériault, R., Ben-Shachar, M.S., Patil, I. et al. Check your outliers! An introduction to identifying statistical outliers in R with easystats. Behav Res (2024). https://doi.org/10.3758/s13428-024-02356-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.3758/s13428-024-02356-w

Keywords

Navigation