A fundamental component of the scientific method is statistical rigor, which has been defined as the “consistency in conceptual development, epistemological stance, application of analytical tools and transparent reporting of their use, and subsequent interpretation and reporting of findings” (Köhler et al., 2017, p. 713). Maintaining statistical rigor ensures the appropriate use of statistical analyses that uphold the scientific method and contribute credible results to the scientific literature. Among the most common threats to statistical rigor are the inappropriate use of statistical analyses (Dar et al., 1994; García-Pérez, 2012; Schatz et al., 2005) and the misreporting and misinterpretation of results (Bakker & Wicherts, 2011; Berle & Starcevic, 2007).

The use of inappropriate statistical tests leads to unreliable measurement and threatens statistical conclusion validity, a special form of internal validity that concerns sources of random error and the appropriate use of statistics and statistical tests (Cook & Campbell, 1979; García-Pérez, 2012). Inappropriate test usage often occurs when an analysis is unable to produce a logical or reliable answer to the research question (García-Pérez, 2012). This is commonly seen when traditional, non-robust procedures (e.g., t-test, analysis of variance, regression) are used when assumptions, such as normality and homogeneity of variance, are violated (Hoekstra et al., 2012; Keselman et al., 1998; Osborne, 2008). As a result, type I and type II error rates are uncontrolled and effect sizes may be over- or underestimated (Osborne & Waters, 2002), ultimately leading to unreliable results and invalid conclusions. This is a noteworthy problem within the psychological science literature, where any mention of checking parametric assumptions is found in roughly 8% of published studies (Hoekstra et al., 2012; Keselman et al., 1998; Osborne, 2008).

Reliability is necessary, but not sufficient, for validity (Fig. 1). Replication failure is inevitable in the presence of reporting error, which has been shown to occur at relatively high rates in the psychological literature (Bakker & Wicherts, 2011; Berle & Starcevic, 2007; Caperos & Pardo Merino, 2013; Nuijten et al., 2017). While reporting error often results from human error, a lack of knowledge of statistical procedures and a lack of clarity in the output generated by statistical software are also major contributors to the misreporting and misinterpretation of findings. The principles of reliability and validity highlight the importance of reproducing results across various instruments, as one instrument may produce invalid results just as reliably as another produces valid results. This also applies to the reliability of the tools used to conduct statistical analyses, specifically the output of various statistical software packages commonly used within a field of research. Researchers rely on statistical software to produce reliable and accurate results, particularly when large datasets or complex statistical models are used; however, researchers have found differences in computational speed (McCoach et al., 2018), accuracy (McCullough, 2000; McCullough & Heiser, 2008), and quantitative results (Bergmann et al., 2000; Grieder & Steiner, 2020; Keeling & Pavur, 2007) when the same data are analyzed by various statistical software packages. Specifically, inconsistencies in algorithmic and computational procedures, rounding, and the reporting of test statistics and p-values have been found across various statistical software packages (Bergmann et al., 2000). Such discrepancies may contribute to reporting error (Bakker & Wicherts, 2011; Levine & Atkin, 2004); hence, the statistical software used for data analysis entails a researcher degree of freedom that must be considered when replication attempts are made.

Fig. 1
figure 1

Reliability is necessary but not sufficient for validity. Note. Validity is not present when results are unreliable (left) and may not be present when results are reliable but produced through the same, improper method (may be the result of direct replication; middle). Validity is only supported by reliable results that are obtained through adequate methods (right).

Computational reproducibility can be further complicated by factors such as version control, ambiguous dependencies and programming, and human error in the reporting of results. Although it is becoming increasingly common for scientists to host data, scripts, and results on open-source repositories, problems may still arise in replication. For instance, failure to document file dependencies and lack of clarity in file execution order may make it difficult to replicate results exactly. Even with documentation and code provided, efforts to replicate results can often fail. One recent investigation found that, out of a sample of 2000 studies with accompanying R code, 74% failed to complete without error (Trisovic et al., 2022). Third-party services (such as GitHub and Docker) offer version control and containerization to address these issues; however, researcher knowledge of these services may be limited, as most scientific training does not cover their use (Peikert & Brandmaier, 2021).

In this study, we investigate the reliability of four statistical software programs commonly used in the field of psychology (SPSS, SAS, Stata, and R). To our knowledge, no study to date has examined the reliability of nonparametric statistical output across statistical software programs. We chose the following five nonparametric statistical tests for our primary analysis due to their common use in quantitative research or their equivalence to commonly used parametric procedures (Table 1): Pearson’s chi-square test of independence, Spearman’s rank-ordered correlation, Kruskal–Wallis one-way analysis of variance, Wilcoxon Mann–Whitney U rank-sum test, and Wilcoxon signed-rank tests. We also tested the reliability of estimates of skewness and kurtosis produced by the four software programs, as these estimates are often used by researchers when determining whether nonparametric analysis is appropriate.

Table 1 Descriptions of the nonparametric tests included in the present analyses

The decision to focus on nonparametric procedures was determined through the consensus of all authors for the following reasons: (1) Nonparametric analyses are commonly used in the literature to compare sample characteristics between participant groups (e.g., Pearson’s chi-square test of independence), although less commonly for various other analyses. (2) While hand- and computer-based calculations of common parametric analyses are typically covered in undergraduate and graduate statistics courses, nonparametric analyses are rarely or never covered to the same degree, especially in required statistics courses in the behavioral sciences (Alder & Vollick, 2000; Friedrich et al., 2018). This may lead to an increased reliance on statistical software for computing such analyses and a decreased likelihood that errors in computation will be noticed or that variations in default algorithms used by different software platforms will be known. (3) Other researchers have recently published material comparing statistical software programs for common parametric analyses as well as more complex statistical models (Bergmann et al., 2000; Brown et al., 2012; McCoach et al., 2018; Oster & Hilbe, 2008a, 2008b; Wang & Johnson, 2019). (4) The authors felt confident in their own understanding and in the capacity of the software programs used in the present study to perform the chosen nonparametric analyses.

Nonparametric inference

The use of nonparametric procedures provides a way to combat many of the threats to statistical rigor described above. There are several advantages to nonparametric methods, namely that they are distribution-free, and thus require few assumptions about the underlying populations from which the data are obtained. Their relative lack of reliance on assumptions and distribution-free nature protects researchers from making false conclusions that can result from misleading significance values obtained from parametric procedures performed in the presence of violated assumptions (Potvin & Roff, 1993). Disadvantages to using nonparametric methods exist as well, and the greatest of these is that they are geared toward hypothesis testing rather than effect estimation. Although it is possible to obtain nonparametric estimates of effect and associated confidence intervals, the processes by which this is done is generally not straightforward (Whitley & Ball, 2002). For example, while critical value tables for probability distributions used for parametric analyses (e.g., z, t, F) are commonly provided in textbooks and online resources, such resources are scarce for the majority of nonparametric statistics. Furthermore, statistical software is often limited in its ability to perform nonparametric procedures.

In contrast to parametric analyses, a large number of algorithmic variations exist for the way in which rank-based nonparametric methods can be implemented, and statistical packages are not consistent in their application of such procedures. The algorithms used to compute test statistics for rank- and frequency-based nonparametric tests vary in one or more of the following three ways: (1) whether the exact, null distribution or an asymptotic, large-sample approximation of the normal (z) or χ2 distribution is used; and in the latter case, whether corrections for (2) continuity and/or (3) tied ranks are applied (Lehmann, 1998; Siegel & Castellan, 1988).

Exact versus asymptotic distributions

Exact p-values and confidence intervals are calculated using the true underlying null distribution, which is discrete in most parametric inference; however, due to computational inefficiency and a lack of null distribution tables for samples of approximately n > 30, an approximation of the true distribution is typically used to calculate asymptotic probabilities for large samples. Asymptotic distributions are approximations of the true underlying distribution and rely on the central limit theorem, assuming that the sample is large enough for the test statistic to approach the normal distribution. Parameters that do not assume a normal distribution, such as those derived from rank-ordered nonparametric tests, are often estimated using asymptotic distribution-free methods that evaluate the median, rather than the mean (Huang & Bentler, 2015; Neave & Worthington, 1988). There is some controversy over the appropriateness of asymptotic versus exact p-values for statistical inference with contingency tables (García-Pérez & Núñez-Antón, 2020; Lydersen et al., 2009; Prescott, 2019); however, while exact and asymptotic probabilities obtained from nonparametric analysis are generally very similar when obtained from large samples, they can be quite different when sample sizes are small. In such situations, asymptotic p-values may lead to unreliable and misleading conclusions if used inappropriately. For example, in a systematic examination of differences between asymptotic and exact probabilities extracted from various nonparametric tests performed on small samples, Mundry and Fischer (1997) found that, relative to exact statistics, asymptotic procedures led to an increase in type I error rates when used for Wilcoxon signed-rank and Wilcoxon Mann–Whitney U tests and an increase in type II error rates when used to compute Spearman’s rank-ordered correlation coefficient.

Continuity correction

When asymptotic distributions are continuous (e.g., normal, χ2), a continuity correction (the addition or subtraction of 0.5 to a discrete x-value) may be applied to improve approximations of the underlying discrete null distribution of nonparametric procedures (Gibbons & Chakraborti, 2011). The appropriateness of continuity correction procedures is also controversial, however, as some argue that they are only appropriate for one-sided tests (Haber, 1982; Mantel, 1976). Some commonly used continuity correction procedures, such as Yates’ correction for the chi-square test (Yates, 1934), are systematically conservative, resulting in overcorrected (larger) p-values when applied to two-sided tests (Maxwell, 1976; Stefanescu et al., 2005). Arguments against the use of a continuity correction for chi-square tests of independence for 2 × 2 tables date back to 1947, where Pearson stated that “… it becomes clear that in the case of small samples, at any rate, this method of introducing the normal approximation gives such an overestimate of the true chances of falling beyond a contour [of p = .05 or .01] as to be almost valueless” (p. 155). In support of this, simulation studies have demonstrated that the application of Yates’ continuity correction to Pearson’s chi-square tests performed on medium to small samples overcorrects the probability of an outcome, resulting in below nominal type I error rates and a substantial loss of power (Campbell, 2007; Garside & Mack, 1976; Grizzle, 1967; Richardson, 1990). Despite this, current textbooks and software packages are inconsistent in their recommendation and default use of continuity correction procedures (Hitchcock, 2009).

Correction for ties

Rank-ordered statistics are typically assumed to be drawn from a continuous population, of which the probability of any two observations being equal in magnitude is zero (Gibbons & Chakraborti, 2011; Siegel, 1957). In practice, two or more observations of the same magnitude commonly occur due to measurement imprecision or because the population distribution is actually discrete. Such observations are considered to be tied, and some method of assigning unique ranks to tied values must be applied so that the test statistics that depend on relative magnitudes of observations (i.e., rank-order statistics) can be computed (Gibbons & Chakraborti, 2011). Most commonly, individual observations within a group of tied observations are assigned the average of the group’s ranked value. Although this method does not affect the mean ranked value, it reduces the variance in the ranks. This affects the underlying null distribution; thus, a correction for ties is typically applied to the calculation of the test statistic.

The appropriateness of applying any of the algorithmic variations described above depends on the characteristics of the data. Typically, the default algorithms used within a given statistical software package depend on the sample size and the presence or absence of tied ranks. As a result, a single software package may apply different procedures to different datasets without user intervention. In addition, the criteria used to determine when exact or asymptotic distributions are used and whether corrections are applied differ between software packages. To further complicate this, multiple algorithms exist for determining exact statistics, and various continuity corrections have been suggested for a given nonparametric test; often, the algorithms and correction procedures used are simply the choice of the software developer rather than that which is most appropriate for the data. When the default algorithmic specifications are not made clear in statistical output, non-statisticians are unlikely to know whether correction procedures were applied to asymptotic p-values or whether the probability statistics reported are exact. As a consequence, inappropriate procedures may be used, and statistical errors or misleading results may be reported. Furthermore, because default algorithms vary across statistical packages, the results reported by researchers who are naïve to the specific algorithms used to compute them might have differed if another statistical package had been used; thus, the conclusions made from statistical analyses conducted under such conditions may be mere reflections of the software used rather than the actual data analyzed.

Transparency in research

To address recent concerns over replication (Open Science Collaboration, 2015; 2012), the scientific community is calling for transparency in research through open science, which may aid in both reducing the number of false positives published in the literature and increasing reproducibility and replicability by encouraging ethical research practices (Ioannidis, 2014). Proponents of open science advise researchers to document and share a detailed account of their study procedures (Borghi & Van Gulick, 2018), make their data available to the public, and preregister all studies. Some have also argued that researchers should document all researcher degrees of freedom (Wicherts et al., 2016), or choices that researchers make when designing a study and collecting, analyzing, and reporting the data (Simmons et al., 2011); this should include a report of the statistical software (and version) used and documentation of the command syntax used for statistical analysis. Researcher degrees of freedom may increase type I error rates, inflate effect sizes, misrepresent the scientific process by undermining the hypothetico-deductive model of the scientific method (Fig. 2), and lead to statistical findings that are difficult to reproduce in the same data or replicate in new data (Asendorpf et al., 2013; Ioannidis, 2005; Wicherts et al., 2016). Documenting researcher degrees of freedom may make it possible to distinguish between true and artifactual replication failures by showing that significant findings occur only when another researcher makes the exact same methodological decisions (Epskamp, 2019; Wicherts et al., 2016).

Fig. 2
figure 2

Researcher degrees of freedom compromise the hypothetico-deductive model of the scientific method. Note. There are several opportunities for researcher degrees of freedom (outside, in red) to impede the scientific method, including a failure to control for bias, lack of replication, low statistical power, poor quality control, lack of documentation, analytical flexibility, inconsistencies in algorithms used across statistical software (particularly relevant to the present study), selective reporting, statistical fallacies, publication bias, and lack of data sharing, among others. These factors work together to undermine the robustness of published research. Adapted from The Seven Deadly Sins of Psychology: A Manifesto for Reforming the Culture of Scientific Practice, by C. Chambers, 2019, Princeton University Press, p. 17 and “A Manifesto for Reproducible Science,” by Munafò et al., 2017

Rationale and objective

We see a gap in the current literature, as there are few published studies that methodically document the differences produced between statistical software programs. Although some previous work has demonstrated the formulaic and computational differences in the calculation of certain parametric and nonparametric methods (Bergmann et al., 2000; Brown et al., 2012; McCoach et al., 2018; Oster & Hilbe, 2008a, 2008b; Wang & Johnson, 2019), this work is largely geared toward mathematicians and statisticians, resulting in considerable difficulty in generalization to various subfields within the social and behavioral sciences. We believe that the existence of such statistical discrepancies would constitute a larger problem in the field that may be currently overlooked; hence, the present methodological investigation seeks to systematically document the reliability of statistical output generated from five nonparametric analyses and two measures of normality across four statistical software packages commonly used in the social and behavioral sciences: SPSS, SAS, Stata, and R.

We conducted this investigation using a large sample (n = 1000) drawn from a public dataset that is widely used in the social psychology literature. All analyses were conducted using the default parameters of each software package and as few additional specifications/options as possible, so as to reduce the number of researcher degrees of freedom involved in our analysis. We hypothesized that the results of all analyses would be largely consistent across the statistical platforms, although minor discrepancies were expected at the decimal point level due to rounding error. If any larger discrepancies in the results were found, they were expected to occur due to differences in the default algorithms used across the software packages.

Additional investigations

Upon seeing the results of our primary analysis, we questioned whether the same findings would be observed in smaller samples that more accurately reflect those seen in the majority of the psychological science literature. We performed an informal systematic review of the recent literature to determine the appropriate sample sizes to use in these post hoc investigations. The PsycINFO database was searched on November 9, 2020, for peer-reviewed, English-language articles published in scientific journals within the last 5 years that employed one or more of the nonparametric analyses investigated herein using the following terms (search criteria were applied within the full text of the article): “Chi-square” OR “Spearman” OR “Kruskal–Wallis” OR “Wilcoxon” OR “Mann–Whitney.” Published study protocols, conference abstracts, meta-analyses, review articles, and mathematical models were excluded. The search results included a total of 792 publications with linked full text available to PsycINFO subscribers. The abstract and methods sections were examined, and the full-text article was retrieved for the first 100 studies (sorted by relevance) that met our inclusion criteria. Data pertaining to sample size, nonparametric test(s) performed, and statistical software package used were extracted from each article (Supplemental Table S1).

The most recent 100 published studies within the psychological science literature that used nonparametric analyses reported results for samples sizes ranging from 9 to 2991 (M = 377.64, SD = 636.92, median = 165.50). Based on these data, our large-N sample size of n = 1000 was as large as or larger than approximately 90% of the samples represented in these 100 studies. To generalize the results to more commonly reported sample sizes within the field, we replicated our analysis on medium (n = 200) and small (n = 100) sample sizes of nonoverlapping data taken from the same public dataset used for the large-N analysis. The medium and small sample sizes chosen approximately reflect the 60th and 30th percentiles, respectively, of those represented in the recent literature. It is important to note that, because the decision to conduct the medium-N and small-N analyses was made after the results of the initial, large-N analysis were known, these post hoc investigations were conducted for exploratory purposes only. Specifically, we explored the extent to which the inconsistencies found in the results of the large-N analysis were present in the results of medium-N and small-N analyses.

Finally, to determine the extent to which any discrepancies found occurred consistently across various samples of data, we performed our analyses on thirty simulated datasets based on the properties of our original small-, medium-, and large-N datasets for confirmatory purposes. The original and simulated datasets are publicly available and can be found in our Open Science Framework (OSF) storage directory at https://osf.io/35umb/.

Method

Dataset

The open-source Race Implicit Association Test (IAT) 2018 data (Xu et al., 2014) were used for all of the present analyses.Footnote 1 The Race IAT was designed to assess for underlying attitudes toward white and black people that participants may be unwilling or unable to identify within themselves (Lane et al., 2007). Project Implicit hosts the Race IAT data for public use (https://osf.io/gwofk/) and allows researchers free access to the data. We chose this dataset due to its widespread use within the social and behavioral sciences (e.g., 14,200 results were returned for a single Google Scholar search for “Implicit Association Test” AND “Race” on November 8, 2020).

We selected eight variables (Table 2) from the Race IAT Public 2018 dataset based on the relevance of their scale of measurement and number of levels for nonparametric analyses, the fewest number of missing data points, and the likelihood of being used in the experimental research published in the literature. It is important to emphasize that the variables used in this work were not chosen for their underlying measured constructs, and thus the results of our analyses are not meant for conceptual interpretation; rather, the variables used in the present study are solely meant for the quantitative comparison of results obtained across statistical software packages.

Table 2 Variables selected from the Race IAT Public 2018 dataset for use in the present analyses

The raw Race IAT Public 2018 dataset consists of 460 variables and 859,470 observations. Data cleaning was performed using Stata version 16.1 (StataCorp LLC, College Station, TX), and this initially involved dropping all observations from respondents who did not complete the Race IAT (session_status values! = “C”), in addition to 1290 observations, which were dropped for respondents who failed to complete all 120 trials across the four combined-task blocks (N_3467 values <120). Finally, Excel was used to identify 59,530 of the remaining 414,167 observations, which were removed due to missing data. From the remaining 354,637 observations, the small-, medium-, and large-N datasets were generated by randomly selecting 100, 200, and 1000 nonoverlapping observations, respectively, using Excel’s RAND function. Observations randomly assigned numbers 1–1000, 1001–1200, and 1201–1300 were respectively assigned to the large-, medium-, and small-N datasets.

Simulations

After obtaining preliminary results from the primary and exploratory analyses, the decision was made to conduct a simulation study to estimate the relative consistency of computational errors and algorithmic variations that were uncovered. Ten simulated datasets were generated for each of the three original datasets, resulting in a total of thirty simulations. Data were generated in MS Excel for Mac version 16.4 using the random number generator function and based on the properties of the original datasets (Supplemental Table S2). Five of the eight variables (sex, race, black, white, order) follow a discrete distribution, and the remaining three variables (latency, discrim1, discrim2) follow a normal distribution, for which a seed of 1234 was used. The simulated data mirror the descriptive properties of the full dataset. All simulated datasets and tables with results for each dataset are publicly available in our OSF repository (https://osf.io/35umb/).

Present analyses

We compared the results of five commonly used nonparametric tests, including Pearson’s chi-square test of independence, Spearman’s rank-ordered correlation (rho), the Kruskal–Wallis one-way analysis of variance, the Wilcoxon-Mann–Whitney U rank-sum test, and the Wilcoxon signed-rank test. Each of these nonparametric tests are described in detail in Table 1. Because it is often necessary to evaluate normality when deciding whether parametric tests are inappropriate, we also compared calculations of skewness (the degree to which a set of data are symmetrically or asymmetrically distributed around the mean) and kurtosis (the extent to which the peakedness of a probability distribution deviates from the shape of a normal distribution) across software platforms.

Although we selected these nonparametric analyses to test in the current study, we recommend that any analyses conducted within MS Excel, SPSS, SAS, Stata, or R be scrutinized and checked across two or more software packages. Researchers who use one software package exclusively for their analyses may find it beneficial to replicate their findings in different software packages, especially when the underlying algorithmic procedures vary across packages.

Documentation of procedure

The statistical software packages used include the following: SPSS 27 (IBM Corporation, LLC, Armonk, NY, 2020), SAS JMP Pro 15.0 (running SAS v 9.4; SAS Institute Inc., Cary, NC, 2019), Stata 16.1 (StataCorp LLC, College Station, TX, 2020), and R 4.0.3 (The R Foundation, 2020). Although R provides many libraries that we could have utilized in our investigation, the purpose of this project was to evaluate the reliability of nonparametric tests using the fewest researcher degrees of freedom; thus, for the current study, we only used the “stats” package, which is 1 of 14 pre-installed packages in R (v 4.0.3). It is possible and likely that the results of the same analyses produced by different R packages would result in meaningful differences; however, that examination is beyond the scope of the current paper. All code used to generate results in command-line-based programs (Stata, R) were documented, and in the case of programs that rely primarily on a graphical user interface (SPSS, SAS JMP Pro), the underlying command syntax generated from the point-and-click commands was also documented. The primary and simulated datasets, the scripts or command syntax used for all analyses (including the MS Excel spreadsheets used for hand calculations, which are described below), all relevant help files/documentation provided by each software package, and logs of the full statistical output generated from each software platform can be found in our OSF storage directory at https://osf.io/35umb/.

Hand calculation procedures

Prior to comparing the results across packages, we calculated each test statistic and asymptotic p-value with and without corrections for continuity and ties (when applicable) using the raw, computational formulas (see Supplemental Material), and these hand-calculated results were considered the ground truth by which all results generated from statistical packages were compared. Due to the large size of the dataset used, literal hand calculations were not possible. Rather, we performed these calculations in MS Excel for Mac version 16.4 (Microsoft Corporation, Redmond, WA, 2020) and henceforth refer to these results as those derived by hand calculation. All hand calculations were checked for accuracy independently by two or more researchers with adequate knowledge of the relevant statistical procedures and extensive experience using the formula builder function in MS Excel.

Hand calculations are useful to perform if the results of the same analyses are conflicted across two or more statistical software packages, as hand calculations may provide a standard by which to compare the discrepant results produced by the statistics software. The benefit of hand calculation is that researchers may more easily determine where the statistical software programs diverge, and which one is providing the output intended by the researcher. This method is best when documentation is ambiguous or missing (e.g., in some R libraries). If documentation is present and thorough across programs, checking the differences in documentation may be easier than hand calculations.

The following procedures were involved in the hand calculations for each statistical test (refer to the Supplemental Material for additional detail):

  • Pearson’s chi-square test of independence: To obtain asymptotic test statistics, 2 × 2 observed and expected frequency tables were generated, and the resulting values were used to obtain the degrees of freedom, χ2 test statistic, and a right-tailed, asymptotic p-value with and without Yate’s continuity correction.

  • Spearman’s rank-ordered correlation: Spearman’s rank correlation coefficient (rho) and a two-tailed, asymptotic p-value were generated after applying a correction for ties. A table of tied ranks was also generated, where unique values and the corresponding number of ties are provided for each variable.

  • Kruskal–Wallis one-way analysis of variance: To obtain asymptotic test statistics, a frequency table including the rank sums and average squared rank sums was generated. Additionally, the degrees of freedom, Kruskal–Wallis H test statistics (following the χ2 distribution), and right-tailed, asymptotic p-values with and without a correction for ties were calculated. To determine the tie correction value (CH), a second table was generated, which includes a list of all unique values with the corresponding number of ties, as well as a range of the number of tied values present in each set of tied values.

  • Wilcoxon Mann–Whitney U rank-sum test: To obtain asymptotic test statistics, a frequency table was generated, which reports the number of total observations and rank sums for each level of the grouping variable, as well as the mean and variance of the Wilcoxon W statistic and the mean and standard deviation of the rank sums for the smaller of the two groups. Additionally, a second table was generated to aid in the tie correction, which includes a list of all unique values with the corresponding number of ties. Finally, the following test statistics were calculated after correcting for ties, both with and without application of a continuity correction: Wilcoxon W, Mann–Whitney U, z-statistic, and a two-tailed, asymptotic p-value.

  • Wilcoxon signed-rank test: To obtain asymptotic test statistics, we generated a frequency table including the number of observations and rank sums for the positive and negative ranks, as well as the number of zero differences, the number of ties, and the mean and standard deviation of the Wilcoxon T statistic (based on the positive ranks). Due to a lack of tied ranks, no table of tied values was produced for this analysis. Finally, the following test statistics were calculated both with and without application of a continuity correction: Wilcoxon T, z-statistic, and a two-tailed, asymptotic p-value.

  • Skewness and kurtosis: Computational formulas were used to determine the population skewness (Cramér, 1946), sample skewness (Bliss, 1967), population kurtosis, where the mean kurtosis of the normal distribution is equal to three (Bock, 1975), and excess sample kurtosis, where the kurtosis of the normal distribution is corrected to have a mean equal to zero (Cramér, 1946).

Criteria for meaningful differences in results

We used the following criteria to determine what would be considered “meaningful” differences in results produced across packages: inconsistencies due to (1) algorithmic variation (e.g., no correction, adjustment for ties, continuity correction), (2) computational error (i.e., results differ from those obtained through hand calculation despite the use of the same computational procedures described in the software documentation), or (3) statistical output (e.g., reporting of exact versus asymptotic p-value; reporting a p-value without the test statistic or vice versa). The latter criteria were included due to the assumption that the same statistics should be reported across software programs, and the use of any variations in the calculation of a test statistic or p-value should be made explicitly clear in the output and provided along with traditional (i.e., uncorrected) calculations. Finally, we selected these criteria because they have the potential to affect statistical conclusion validity. We recommend that researchers who choose to examine their results across statistical software platforms assess for these criteria as the cause for these meaningful differences.

Results

The default procedures and algorithms used by SPSS, SAS, Stata, and R to conduct Pearson’s chi-square test of independence, Spearman’s rank correlation coefficient (rho), Kruskal–Wallis one-way analysis of variance, Wilcoxon Mann–Whitney U rank-sum test, Wilcoxon signed-rank test, skewness, and kurtosis are reported in Table 3. Additionally, Table 3 includes details regarding the hypothesis tested, according to the respective package’s help files/documentation, the statistical output generated from each analysis, and the basic command syntax with additional options or arguments that must be specified by the user (i.e., are not applied by default) in order to generate specific results (e.g., test statistic, exact p-values) or to use alternative algorithms (e.g., apply a continuity correction). Greater detail regarding the default procedures and algorithms used by each package can be found in the Supplemental Material and in the help files/documentation for each software package; we have extracted the relevant documentation for the present analyses, and it is included in our OSF storage repository at https://osf.io/35umb/.

Table 3 Default statistical procedures employed by each software package

The numerical results of all large-, medium-, and small-N analyses are reported in Tables 4, 5, and 6, respectively, for comparison across all statistical packages with the hand-calculated results. The results in each table are organized according to the default algorithms used by a given package (i.e., asymptotic test statistics with or without correction for ties and/or continuity). Exact p-values were not generated by hand for any analysis.

Table 4 Results obtained across packages compared with hand calculations for the large-N analyses
Table 5 Results obtained across packages compared with hand calculations for the medium-N analyses
Table 6 Results obtained across packages compared with hand calculations for the small-N analyses

Algorithmic variation

The default algorithms used to produce the results of all analyses differ widely across the four packages we compared.

Pearson’s chi-square

When conducting Pearson’s chi-square test, SAS and Stata apply no continuity correction, R applies Yates’ continuity correction, and SPSS generates results both with and without Yates’ continuity correction. As a result, the only p-value produced by R is larger than the only p-values produced by SAS and Stata by values of .049, .073, and .154 for the large-, medium-, and small-N analyses, respectively. Additionally, Fisher’s exact one- and two-tailed p-values are provided by SPSS for all data and by SAS for 2 × 2 contingency tables, by default. With the exception of minor rounding discrepancies, the two-tailed p-values provided were equal between SPSS and SAS across the large-, medium-, and small-N analyses.

Spearman’s rho

All four packages adjust for ties, by default, when computing Spearman’s rho; however, if no ties are present and the sample size is less than 1290, R will produce an exact p-value in place of the asymptotic p-value.

Kruskal–Wallis

Although no ties were present in the variables assessed by the Kruskal–Wallis test, R does not adjust for ties by default, whereas results are adjusted for ties, by default, in SPSS and SAS, and Stata produces results both with and without an adjustment for ties.

Wilcoxon Mann–Whitney U

When performing the Wilcoxon Mann–Whitney U test, all packages adjust for ties; however, SAS and R also apply a continuity correction, by default, resulting in p-values that are larger by values of .001 and .002 than those which are produced by SPSS and Stata for the medium- and small-N analyses, respectively. Additionally, the formulas used to calculate the Wilcoxon Mann–Whitney U test statistics are not consistent across packages. In Stata and R, Wilcoxon’s W is calculated according to the method originally defined by Wilcoxon (1945), where W is the sum of ranks for the first sample (determined by the values that define each group in ascending order), whereas Wilcoxon’s W is defined by SPSS as the sum of ranks for the second sample and by SAS as the sum of ranks for the sample of smaller size (if sample sizes are equal, the sum of ranks for the second sample is used). As a result, SPSS produced different values for W and U, along with an inverted z-statistic across the large-, medium-, and small-N analyses, and SAS produced a different value for W and an inverted z-statistic for the large- and small-N analyses. Finally, Stata also produces an exact p-value for samples less than or equal to 200 using a recursive algorithm defined by Hill and Peto (1971). No other packages produced exact p-values for the Wilcoxon Mann–Whitney U; however, it is of note that for sample sizes less than 50, R will produce an exact p-value, calculated using the algorithm described by Bauer (1972) with a Hodges and Lehmann (1963) estimation, in place of the asymptotic p-value.

Wilcoxon signed-rank

Similar to the procedures applied to the Wilcoxon Mann–Whitney U, while all packages adjust for ties when performing the Wilcoxon signed-rank test, R is the only package to also apply a continuity correction, by default; however, this did not affect the p-values produced by R to a meaningful amount, as the only discrepancy was observed for the p-value derived from the small-N analysis, which was smaller by a value of .001 in R, relative to the values produced by SPSS, SAS, and Stata. Additionally, the algorithm used to calculate the Wilcoxon T statistic in SAS differs from the generally accepted definition provided by Wilcoxon (1945), where the test statistic is the smaller of either the sum of ranks with positive differences or the sum of ranks with negative differences. This definition is used by SPSS, Stata, and R; however, the test statistic produced by SAS is based on the sum of the absolute value of signed ranks.

Skewness and kurtosis

Finally, unbiased calculations of sample skewness and excess sample kurtosis (where the expected value for kurtosis of a sample with a normal distribution is 0) are only used by SPSS and SAS. Despite the bias inherent in the use of population estimates, Stata does not provide any other built-in option for the way in which these measures of normality are estimated for a sample of data. Additionally, it is important to note that R has no built-in functionality for calculating skewness or kurtosis.

Simulations

The results of the simulation studies shed further light on the impact of algorithmic variations on the results of nonparametric analyses between software packages (tables with the simulations results can be found in our OSF storage at https://osf.io/35umb/). Of particular importance is the extent to which the p-values differ between uncorrected and continuity-corrected p-values produced for Pearson’s chi-square. Across all 30 simulations, the difference in continuity-corrected versus uncorrected p-values ranged from .005 to .162 (M = .082, SD = .050). These differences were more extensive for the smaller samples (Fig. 3), where the results for the small-N simulations differed by an average of .123 (SD = .048), and those of the large-N simulations differed by an average of .033 (SD = .018). While continuity-corrected p-values were the same as uncorrected p-values produced for the Wilcoxon tests in the large-N simulations, continuity-corrected p-values for the Wilcoxon Mann–Whitney U test were greater by a value of .001 on 7 of the 10 medium-N simulations, and by an average value of .002 (SD = .001) for all of the small-N simulations.

Fig. 3
figure 3

Average difference in Yates’ continuity-corrected versus uncorrected asymptotic p-values for Pearson’s chi-square test across large, medium, and small sample sizes. Note. Large n = 1000; medium n = 200; small n = 100; average values are based on differences (continuity-corrected − uncorrected) in two-tailed p-values across primary and simulated datasets, with the standard deviation of the differences reflected in the error bars.

Computational error

Few inconsistencies due to computational error were observed between the hand-calculated results and those produced by the statistical packages, and all appeared in the results from the simulated datasets. These inconsistencies could not be reduced to rounding error when the same computational procedures were used for calculation (according to the documentation provided by the software) and were therefore determined to be produced by the statistical package in error. The most prevalent error occurred in several of the results provided by SPSS for the Wilcoxon Mann–Whitney U test across the large-, medium-, and small-N simulations, where Wilcoxon’s W was computed as the sum of ranks for the first sample. The documentation clearly states that the sum of ranks for the second sample is used to calculate Wilcoxon’s W, and there is no apparent pattern as to when this deviation from the documented procedure would occur. Although the results produced matched those obtained through hand calculation, we consider them to be due to computational error, as the method used to obtain these findings is not consistent with that which is described in the package’s documentation. Additionally, SPSS and R failed to compute the test statistic and continuity-corrected p-value for Pearson’s chi-square test on two of the simulated medium-N datasets and one of the simulated small-N datasets due to the presence of one cell with zero observations in the 2 × 2 contingency table; the exact p-values produced by SPSS and SAS for these three simulated datasets were also affected by this (i.e., p = 1.000).

Statistical output

Extensive differences were observed in the statistical output produced for each nonparametric test across the four statistical packages, and the output was consistent within each package across the primary and simulated analyses. Each package produces a variety of statistical output in addition to the specific test statistics that are typically required to be reported and interpreted in the literature (Table 3); see the Supplemental Material for details of the statistical output provided across packages. With regard to the inferential statistics that are necessary for one to make basic conclusions about the hypothesis that was tested (i.e., the test statistics, degrees of freedom, and p-values), several inconsistencies continued to be seen across SPSS, SAS, Stata, and R.

Pearson’s chi-square and Kruskal–Wallis

While test statistics, degrees of freedom, and p-values were reported by all packages for Pearson’s chi-square and Kruskal–Wallis tests, only SPSS provided these both with and without continuity correction for Pearson’s chi-square, and only Stata provided these both with and without an adjustment for ties for the Kruskal–Wallis test. Fisher’s exact one- and two-tailed p-values are reported with Pearson’s chi-square results by SPSS and SAS, by default, for 2 × 2 contingency tables. Stata will also produce Fisher’s exact p-values for any data if the option, exact, is included in the command. Across the primary and simulated datasets, exact two-tailed p-values reported by SPSS and SAS differed from uncorrected and continuity-corrected asymptotic p-values as a function of sample size (Fig. 4). On average, exact two-tailed p-values were greater than uncorrected asymptotic p-values by .021 (SD = .015), .049 (SD = .031), and .055 (SD = .046) for the large-, medium-, and small-N analyses, respectively. In contrast, exact two-tailed p-values were smaller than continuity-corrected asymptotic p-values obtained from the large-, medium-, and small-N datasets by values of −.013 (SD = .010), −.047 (SD = .032), and − .071 (SD = .049), respectively.

Fig. 4
figure 4

Average differences in Fisher’s exact (two-tailed) versus uncorrected and continuity-corrected asymptotic p-values for Pearson’s chi-square test across large, medium, and small sample sizes. Note. Large n = 1000; medium n = 200; small n = 100; average values are based on differences (exact − asymptotic) in two-tailed p-values across primary and simulated datasets, with the standard deviation of the differences reflected in the error bars.

Wilcoxon Mann–Whitney U

In the output produced for the Wilcoxon Mann–Whitney U test, SPSS alone generated Wilcoxon’s W, Mann–Whitney’s U, the asymptotic test statistic (z-statistic), and the asymptotic p-value. In contrast, the output from SAS and Stata did not include the Mann–Whitney U statistic, and while Wilcoxon’s W technically appears in the rank-sum frequency table, it is neither clearly indicated nor differentiated from the other rank-sum values reported in Stata’s output. Additionally, only the Mann–Whitney U statistic and the continuity-corrected p-value are provided in the output generated in R; however, the test statistic is incorrectly labeled as “W.” Exact two-tailed p-values for the Wilcoxon Mann–Whitney U test are also reported by Stata for samples of n ≤ 200 and therefore appear in our medium- and small-N results; exact p-values were very similar to asymptotic p-values across all medium- and small-N datasets. No other packages produced exact statistics for the Wilcoxon Mann–Whitney U test performed on our data; however, an exact two-tailed p-value is reported by R in place of the continuity-corrected p-value when n < 50 and no ties are present, and according to the documentation, SPSS reports an exact 2*one-tailed p-value along with the asymptotic p-value when n is “not too large.”

Wilcoxon signed-rank

Finally, none of the four packages reported all of the fundamental inferential statistics in their output for the Wilcoxon signed-rank test. While both SPSS and Stata report the asymptotic z-statistic and p-value, neither reports the Wilcoxon T statistic (although it appears without a label in the rank-sum frequency table). In contrast, while the Wilcoxon T statistic and p-values are reported in the output of both SAS and R, neither of these packages reports the asymptotic z-statistic, and R incorrectly labels the test statistic as “V.” An exact two-tailed p-value was generated by Stata for the medium- and small-N analyses, and this occurs by default for samples of n ≤ 200; exact p-values were very similar to asymptotic p-values across all medium- and small-N datasets. No other exact statistics were generated for the Wilcoxon signed-rank test across packages; however, SAS will report an exact p-value when n ≤ 20 and R will replace the asymptotic p-value with an exact p-value when n < 50 and no ties are present. It is important to note that SAS labels all test statistics as “S,” even though several test statistics are commonly defined by specific labels.

Discussion

In the current era of psychological science, much of the field’s attention has been directed at methodological practices and reproducible science. However, we have seen little evidence that this increased attention has brought much awareness to inconsistencies between statistical software packages. The present study investigated whether the same results would be generated from the same nonparametric statistical procedures when performed on the same dataset by different statistical software packages. Although results were largely consistent across software packages, our findings bring into question the extent to which reporting error and/or misinterpreted results are present in the psychological science literature. Our most notable findings were related to the extent of algorithmic variation that exists across packages for a given nonparametric analysis. The algorithms implemented are often native to each software package and do not require user specification, and they are not consistently applied within a single package for samples with varying characteristics. These adjustments included performing automatic corrections or transformations that, while useful in certain situations, are not always readily apparent to and may not be desirable by the user conducting the tests.

In our data, the algorithmic variations generally resulted in minor test statistic or p-value discrepancies that may not have substantial impacts on one’s statistical conclusions. The exception to this lies in the continuity-corrected versus uncorrected p-values generated for Pearson’s chi-square analysis (Fig. 3). By default, uncorrected p-values are generated by SAS and Stata, and Yates’ continuity correction is applied to the calculation of the chi-square statistic in R, resulting in a larger p-value. The only p-value produced by SAS and Stata was smaller than the only p-value produced by R by an average value of .083 across our primary and simulated datasets. The differences in continuity-corrected and uncorrected p-values ranged from .005 to .162, largely as a function of sample size. In addition, Fisher’s exact p-values are provided with these results by some packages, depending on the characteristics of the data. When provided with the results of our analyses, exact p-values were larger than uncorrected asymptotic p-values but smaller than continuity-corrected asymptotic p-values, and these differences also varied as a function of sample size (Fig. 4). The discrepancies seen here demonstrate the potential for statistical conclusions to be influenced by the software package used, especially when users are unaware of the underlying algorithm used or naïve as to when/if a continuity correction or exact statistic is appropriate. This has important implications, given that Pearson’s chi-square is among the most commonly used nonparametric tests in the psychological science literature. Furthermore, in cases where multiple p-values are provided for a given test, a naïve user may be inclined to report the value that supports their hypothesis, especially if they are not sure which p-value is the most appropriate for their data.

Perhaps the most intriguing finding, however, is the computational error we encountered from SPSS, where the program would unpredictably calculate the Wilcoxon Mann–Whitney U test statistic based on the rank sum of the first sample, when the documentation clearly states that the second sample is used for calculation. Although this has no effect on the p-value and should not impact conclusions of statistical significance, the inverted z-statistic may result in confusion over the direction of an effect.

Overall, our findings suggest that the results of several common nonparametric tests and measures of normality differ when performed on the same data across SPSS, SAS, Stata, and R. These differences were driven primarily by the inconsistent application of default algorithmic procedures across software packages. Furthermore, our medium- and small-N analyses revealed that, as we approach sample sizes more commonly used in the current literature, discrepancies in statistical output are increased, due to the application of continuity corrections or calculation of exact statistics.

We also found that some statistical output generated from these nonparametric tests included unnecessary information. We believe that programs producing unnecessary and ambiguous output may increase confusion and/or selective reporting of misleading results in the literature, especially when there is no clear indication or description of the values generated. In particular, we argue that confusion is more likely when a student, psychological science researcher, or other non-statistician investigator attempts to interpret the results of a single test when multiple p-values are present in the generated output. This may be especially relevant when algorithms used to generate the different p-values are not clearly indicated in the output or defined in the documentation. At worst, when multiple p-values are provided for a single test, we fear that publication bias and the current “publish or perish” mentality in scientific research may lead to selective reporting of whichever p-value most closely aligns with the hypothesis. Taken together, these factors and discrepancies may contribute to the replication failure in the social and behavioral science literature.

Overall, although we found that there were no discrepancies across statistical packages that would result in an incorrect conclusion from our data, the findings do suggest that within statistics packages, poor documentation and labeling of output may lead to an incorrect conclusion. For example, we found large differences in p-values between continuity-corrected and uncorrected p-values. It is possible that a researcher may select the smaller of the two p-values, either purposefully, due to unclear labeling of the output, or because it is the only p-value a given statistical package produces. As can be seen in Figs. 3 and 4, differences in p-values that occur due to algorithmic variations can be large, especially in smaller datasets.

Limitations and future directions

The current investigation has noteworthy limitations. Although we found differences in the results generated across software packages, we want to encourage caution when generalizing these findings to other programs and other versions of the software. Software developers (especially those working on open-access software, like R) are continually updating their programs and correcting bugs in the software. Many of the errors we found may be due to bugs in the software that have not yet been discovered or due to differences in the output selected by the program; thus, the findings discussed herein are specific to the programs we compared and the versions of these programs we used for the present analyses. We encourage readers to consider the context in which the present discrepancies across results were found and to use the present findings as motivation to more closely consider their own statistical results. It is also important to note that the inconsistent findings considered to be due to computational error may not be the result of miscalculation per se; these results were simply referred to as computational error because they did not reflect the true result when the computational procedures described in the packages’ documentation were followed. It is likely that the procedures were incorrectly described in the documentation, and in such cases, these inconsistencies would be better described as misspecifications than error.

Our results may not generalize to other, more commonly used statistical procedures, such as parametric analyses. A large contributor to the heterogeneity we found in results across statistical packages was due to algorithmic variation, particularly those related to continuity corrections and the calculation of exact p-values. These procedures are not relevant in parametric analyses; thus, there may be fewer ways in which the results of parametric statistics could differ across software packages. It is also important that our findings are replicated using other meaningful data so as to determine the extent to which the discrepancies seen here may impact results from data of more practical or clinical importance. Further, we strongly encourage future work to leverage systematic and meta-analytic methods to examine the existing literature in terms of statistical rigor, researcher degrees of freedom, and methodology involved when conducting nonparametric analysis, as such work would greatly benefit ongoing replication efforts. Additionally, a similar consideration of the most frequently used statistical methods in the general psychological science literature is needed. While we conducted a cursory review of nonparametric methods used in the recent psychological science literature, a meta-analysis or systematic review documenting researcher degrees of freedom in methods used when performing nonparametric analysis would provide a clearer understanding of the current state of the problem in our field.

Bootstrapping is a technique used to replicate the results of an analysis a large number of times across simulated datasets given a chosen distribution. The technique is growing in popularity and is frequently being included in the latest statistical software. We did not conduct our simulations by pulling the formulas from each statistical software platform and testing them within the program itself to examine how they perform comparatively, and the lack of such simulation procedures conducted in this manner may be a limitation. Still, we did not include this simulation as we argue that it is beyond the scope of the current paper (i.e., we can address the question of whether there are differences in output across software packages without conducting a large simulation in R). Further, although using bootstrapping may address some limitations of relying on a single analysis in a single statistical software package (e.g., by producing simulated confidence intervals), we caution using bootstrapping as a “fix” for discrepancies across statistical software. Bootstrapping analyses may be prone to the same variations in computations, formulas, labeling, and output across statistical software that the analyses examined in the current paper produced. Thus, our recommendation for cross-examining results applies similarly to bootstrapping and other simulation techniques.

Finally, we would like to note the open-source nature of R that separates it from the other statistical packages evaluated herein. Any user can create libraries that can be download and used by others for their analyses. User-written libraries do not undergo any rigorous testing or vetting processes; thus, there is a risk in utilizing libraries outside of the 14 that are built into the program, as they may be prone to error. Furthermore, the documentation for these user-written libraries ranges from comprehensive to none. Given these limitations, if a researcher only uses R with third-party libraries to analyze their data, cross-examination between R and other established statistical software platforms may be especially important.

Implications and recommendations

The present results raise questions about previous replication efforts. It is well known within the social and behavioral sciences that efforts to replicate many landmark studies have failed (Open Science Collaboration, 2015). It is possible that algorithmic variation or other differences in the computation or reporting of results between statistical software packages may be contributing to this problem, and future replication efforts should consider this possibility and focus on determining whether failures to replicate may be due, at least in part, to differences between the software used. The heterogeneity we have demonstrated in our results speaks to the importance of open-access science; without knowing the underlying procedures used to generate statistical results, it may be difficult to directly replicate previous studies. Replication efforts within the social and behavioral sciences have made significant strides toward promoting open-access science, but work remains to be done.

The results of the current study suggest that researchers may benefit from cross-examining their results across statistical software platforms. We recommend making a consistent practice of analyzing the results from one statistical software platform to at least one other platform. We argue that this practice may reduce the chances of drawing inaccurate conclusions due to inconsistencies or errors in the output, labeling, or underlying algorithms used by a given software package. At the time of this publication, Stata appeared to be the most consistent with our hand calculations. However, the differences described in the current paper are likely to be resolved or altered through software updates, so this may not always be the case. Furthermore, consistency with hand calculation does not necessarily mean one platform produced more accurate results than another, but that the algorithms used to compute the results were more consistent with our expectations, based on original formulas provided in textbooks and other literature. Finally, if inconsistencies arise in a cross-examination, we recommend checking documentation and conducting hand calculations to best determine the root of the discrepancies. Thus, we argue that the choice to cross-examine one’s results across software platforms and to review the formulas provided in the software’s documentation are more important than the choice of software. Once researchers have chosen which software to use, they should provide a justification for their choice in their manuscript.

We advocate for fully open-access science, including public hosting of datasets, command syntax used to generate results, and all study materials used in the final analysis. New organizations, such as the Open Science Framework (OSF), have made considerable progress in the promotion of open-access science by emphasizing the preregistration of scientific studies, where researchers may report all study procedures prior to beginning data collection and publish the results upon study completion. Additionally, OSF has data-hosting options that allow researchers to post their datasets and analysis pipelines for future replication efforts. This level of transparency in science remains promising, as more publishers require the documentation of study procedures and as academia begins to encourage preregistration and sharing of all study data and materials. We suggest that future studies continue this trend of transparency.

We strongly encourage all researchers, regardless of career stage, to carefully examine the documentation in software packages, and to ensure that the package selected is performing the test as they expect it to. Furthermore, users should be aware of the computational idiosyncrasies of the software used to conduct the statistical analyses, and it is recommended that the results are compared across platforms to avoid reporting errors or misleading results that are due to algorithmic variation, computational error, or other output-related characteristics of a given program. Of utmost importance is the necessity for adequate reporting of the statistical procedures used when analyzing data and full transparency when publishing the results of such analyses.

We also suggest that future work consider the use of outside services to address issues, such as version control, containerization, and ambiguous documentation of code/dependencies, which may cause future replications to fail. GitHub is a popular third-party service that allows for the hosting of version-controlled code. Similarly, Docker is another service that constructs a virtual environment (referred to as a “container”) that allows individuals to share not only their code, but also all of the dependencies and software needed to run the code. The use of containers eliminates any ambiguity that may arise concerning file execution order or the dependencies required to replicate an analysis. Further, we strongly encourage researchers to consider clearly annotating any code, scripts, or other documentation shared publicly. Commenting at each step of a script not only increases clarity and transparency, but will assist future replication efforts in understanding the steps of the analysis that were taken. An excellent example of the level of transparency required is provided by Peikert and colleagues’ (2021) tutorial on accomplishing these tasks in R.

Some of the burden of open-access science lies on the shoulders of the software developers. Several issues were present in the help files and documentation provided by the software packages, where the information provided for the various analyses we conducted was often lacking sufficient information about a given test (e.g., R) or about the algorithms used to produce the results (e.g., Stata), and in some cases, the documentation was not easily accessible (e.g., SAS) or was difficult to interpret (e.g., SPSS). Ambiguity in documentation could play a role in the inconsistent replication of scientific experiments that has been noted in recent years, and it is essential that the software developers are made aware of the importance of clear, exhaustive, and easily accessible documentation for these packages.

Conclusion

The results of our study demonstrate the unreliability of results produced for nonparametric tests and measures of normality across SPSS, SAS JMP Pro, Stata, and R. The inconsistent results were primarily due to differences in the default algorithms used; however, computational error and differences in statistical output also contributed to the unreliability of results. These discrepancies, along with unnecessary and/or unclearly defined statistical output generated for a given test, may contribute to confusion, selective reporting or reporting error, and ultimately, replication failure. We urge researchers to refer to documentation when interpreting results (when it is available), compare their statistical results across software platforms, and describe all statistical procedures, including the statistical software package used, when publishing results, to aid in the success of future replication efforts.