Violating the normality assumption may be the lesser of two evils

When data are not normally distributed, researchers are often uncertain whether it is legitimate to use tests that assume Gaussian errors, or whether one has to either model a more specific error structure or use randomization techniques. Here we use Monte Carlo simulations to explore the pros and cons of fitting Gaussian models to non-normal data in terms of risk of type I error, power and utility for parameter estimation. We find that Gaussian models are robust to non-normality over a wide range of conditions, meaning that p values remain fairly reliable except for data with influential outliers judged at strict alpha levels. Gaussian models also performed well in terms of power across all simulated scenarios. Parameter estimates were mostly unbiased and precise except if sample sizes were small or the distribution of the predictor was highly skewed. Transformation of data before analysis is often advisable and visual inspection for outliers and heteroscedasticity is important for assessment. In strong contrast, some non-Gaussian models and randomization techniques bear a range of risks that are often insufficiently known. High rates of false-positive conclusions can arise for instance when overdispersion in count data is not controlled appropriately or when randomization procedures ignore existing non-independencies in the data. Hence, newly developed statistical methods not only bring new opportunities, but they can also pose new threats to reliability. We argue that violating the normality assumption bears risks that are limited and manageable, while several more sophisticated approaches are relatively error prone and particularly difficult to check during peer review. Scientists and reviewers who are not fully aware of the risks might benefit from preferentially trusting Gaussian mixed models in which random effects account for non-independencies in the data. Supplementary Information The online version contains supplementary material available at 10.3758/s13428-021-01587-5.


2
Description of the TrustGauss functions 3 The main function of the "TrustGauss" package is TrustGauss() and it can take 29 arguments that are 4 described in the documentation to the function and that can be accessed via 5 6 > ?TrustGauss 7 8 TrustGauss() can be used to assess type I error rates of linear regression models that are fitted through 9 a call to the base R function glm(). We here briefly summarize each of the 29 arguments. Default 10 settings for each argument are given in the documentation to the function. 11 12 1. Family. This argument takes a character input and specifies the error distribution and link 13 function to be used in the generalized linear model. It can take one of the following values: 14 "gaussian", "poisson", "binomial", "quasipoisson", "quasibinomial" or "Gamma". Since the 15 argument is passed directly to the glm() function, the link function can be specified in the standard 16 way, for example as "gaussian(link = 'identity')". See also the glm() function for further details. 17 2. nSamples. This argument takes a numeric integer input, specifying the number of samples/data 18 points to simulate. 19 3. nSimulations. This argument takes a numeric integer input, specifying for how many iterations 20 the simulation will run. 21 4. SaveAllOutput. This argument is Boolean. If it is set to TRUE, all individual data points of the 22 dependent and independent variables are returned in a list. They can be found in "Data" with 23 column names "Dependent", "Cov1", "Cov2", ..., "Fac1", "Fac2", ... (depending on what 24 combination of covariates and factors is specified, see below). 25 5. CompareTtest. This argument is Boolean. If it is set to TRUE, P-values are calculated through 26 both the glm() and the t.test() function. This is only valid when a single factor with two levels is 27 fitted as the independent variable with distribution set to "UniformCategorical" (see below). 28 6. PlotExample. This parameter is Boolean. If it is set to TRUE, one example histogram of the 29 distribution of the dependent variable Y is plotted. 30 7. DistributionY. This argument takes a character input and specifies the distribution of the 31 dependent variable Y. It can take one of the following values: "Gaussian", "GaussianCategorical", 32 "GaussianZero", "AbsoluteGaussian", "Gamma", "GammaCategorical", 33 "GaussianZeroCategorical", "Binomial", "NegativeBinomial", "StudentsT", "Poisson" or 34 "Uniform". In principle, the base R functions for generating randomly distributed Gaussian be specified (see below). "GaussianCategorical" generates normally distributed integers.

38
"GaussianZero" generates a zero-inflated normal distribution. "AbsoluteGaussian" simulates 39 absolute values of a Gaussian distribution. "GaussianZeroCategorical" first generates a zero-40 inflated normal distribution and then produces categories. "GammaCategorical" generates gamma 41 distributed integers. 42 8. DistributionXCov. This argument takes a character input and specifies the distribution of the 43 independent covariate X. It can take the same values as DistributionY. It is also possible to specify 44 multiple different distributions in order to fit more than one covariate (as a vector). Additionally, it 45 can be set to NULL if only factors should be fitted. 46 9. DistributionXFac. This argument takes a character input and specifies the distribution of the 47 independent factor X. Only the categorical distributions are valid inputs ("GaussianCategorical", 48 "GammaCategorical", "GaussianZeroCategorical", "Binomial" or "UniformCategorical"). It is 49 also possible to specify multiple different distributions in order to fit more than one factor (as a 50 vector). Additionally, it can be set to NULL if only covariates should be fitted. 51 52 The following arguments specify parameters for the distributions of the independent and dependent 53 variables. 54 55 10. MeanY.gauss. This argument takes a numeric input, specifying the mean of the distribution of 56 Y, if DistributionY is set to "Gaussian", "GaussianCategorical", "GaussianZero", 57 "GaussianZeroCategorical" or "AbsoluteGaussian". See also the rnorm() function for further 58 details. The operations of categorization, taking the absolute value or adding zero-inflation are 59 performed after the call to the rnorm() function. 60 11. SDY.gauss. This argument takes a numeric input, specifying the standard deviation of the 61 distribution of Y, if DistributionY is set to "Gaussian", "GaussianCategorical", "GaussianZero", 62 "GaussianZeroCategorical" or "AbsoluteGaussian". See also the rnorm() function for further 63 details. The operations of categorization, taking the absolute value or adding zero-inflation are 64 performed after the call to the rnorm() function. 65 12. nCategoriesY.cat. This argument takes a numeric integer input, specifying how many 66 categories are simulated, if DistributionY is set to "GaussianCategorical" or "GammaCategorical". 67 13. zeroLevelY.zero. This argument takes a numeric input, specifying the proportion of data that 68 will be set to 0, if DistributionY is set to "GaussianZero" or "GaussianZeroCategorical". 69 14. ShapeY.gamma. This argument takes a numeric input, specifying the shape parameter k, if 70 DistributionY is set to "Gamma", "GammaCategorical" or "NegativeBinomial". See also the 71 rgamma() and rnbinom() functions for further details. Categorization is performed after the call to 72 the rgamma() function. 73 15. ScaleY.gamma. This argument takes a numeric input, specifying the scale parameter capital 74 theta, if DistributionY is set to "Gamma", "GammaCategorical" or "NegativeBinomial". See also 75 the rgamma() and rnbinom() functions for further details. Categorization is performed after the call 76 to the rgamma() function. 77 16. DFY.student. This argument takes a numeric input, specifying the degrees of freedom, if 78 DistributionY is set to "StudentsT". See also the rt() function for further details. 79 17. MinY.uni. This argument takes a numeric input, specifying the minimum of the distribution, if 80 DistributionY is set to "Uniform". See also the runif() function for further details. 81 18. MaxY.uni. This argument takes a numeric input, specifying the maximum of the distribution, if 82 DistributionY is set to "Uniform". See also the runif() function for further details. 83 19. LambdaY.pois. This argument takes a numeric input, specifying the mean of the distribution, if 84 DistributionY is set to "Poisson". See also the rpois() function for further details. 85 20. MeanX.gauss. This argument takes a numeric input, specifying the mean of the distribution of 86 the independent variable X, if DistributionX is set to "Gaussian", "GaussianCategorical", 87 "GaussianZero", "GaussianZeroCategorical" or "AbsoluteGaussian". See also the rnorm() function 88 for further details. See also the rnorm() function for further details. The operations of 89 categorization, taking the absolute value or adding zero-inflation are performed after the call to the 90 rnorm() function. 91 21. SDX.gauss. This argument takes a numeric input, specifying the standard deviation of the 92 distribution of the independent variable X, if DistributionX is set to "Gaussian", 93 "GaussianCategorical", "GaussianZero", "GaussianZeroCategorical" or "AbsoluteGaussian". See 94 also the rnorm() function for further details. See also the rnorm() function for further details. The 95 operations of categorization, taking the absolute value or adding zero-inflation are performed after 96 the call to the rnorm() function. 97 22. nCategoriesX.cat. This argument takes a numeric integer input, specifying how many 98 categories are simulated, if DistributionX is set to "GaussianCategorical" or "GammaCategorical". 99 23. zeroLevelX.zero. This argument takes a numeric input, specifying the proportion of data that 100 will be set to 0, if DistributionX is set to "GaussianZero" or "GaussianZeroCategorical". 101 24. ShapeX.gamma. This argument takes a numeric input, specifying the shape parameter k, if 102 DistributionX is set to "Gamma", "GammaCategorical" or "NegativeBinomial". See also the 103 rgamma() and rnbinom() functions for further details. Categorization is performed after the call to 104 the rgamma() function. 105 25. ScaleX.gamma. This argument takes a numeric input, specifying the scale parameter capital 106 theta, if DistributionX is set to "Gamma", "GammaCategorical" or "NegativeBinomial". See also 107 the rgamma() and rnbinom() functions for further details. Categorization is performed after the call 108 to the rgamma() function. 109 26. DFX.student. This argument takes a numeric input, specifying the degrees of freedom, if 110 DistributionX is set to "StudentsT". See also the rt() function for further details. 111 27. MinX.uni. This argument takes a numeric input, specifying the minimum of the distribution, if 112 DistributionX is set to "Uniform". See also the runif() function for further details. 113 28. MaxX.uni. This argument takes a numeric input, specifying the maximum of the distribution, if 114 DistributionX is set to "Uniform". See also the runif() function for further details. 115 29. LambdaX.pois. This argument takes a numeric input, specifying the mean of the distribution, if 116 DistributionX is set to "Poisson". See also the rpois() function for further details.

118
The function TrustGaussTypeII() can be used for the analysis of type II error rates as described in the 119 main text. It adds a predefined effect to a single covariate only. All arguments can be accessed via 120 121 > ?TrustGaussTypeII 122 123 The function takes all the above 29 arguments of the TrustGauss() function and two additional ones. 124 125 30. EffectXCov. This argument takes a numeric input, specifying the effect size that should be 126 simulated. 127 31. ZTransform. This argument is Boolean. If it is set to TRUE, the distributions of the dependent 128 variable Y and the independent variable X are Z-transformed prior to adding the effect specified via 129 EffectXCov.

131
The function TrustGaussLMM() can be used to fit linear mixed-effects models and to assess type I 132 error rates. It takes the above 29 arguments of the TrustGauss() function. Furthermore, a single 133 random effect can be specified via 134 135 32. RanEF. This argument is Boolean. If it is set to TRUE, a single random effect is fitted. 136 33. nRanEFLevels. This argument takes a numeric integer input, specifying how many repeated 137 measures for each sampling points are generated. 138 34. RanEFVarExp. This argument takes a numeric input, specifying the amount of variance 139 explained by the random effect. This amount is only correct if the variables are normally 140 distributed.

142
The three functions in TrustGauss return a list with at least five elements 143 144 1. A data frame Pvals with all P-values of covariates and factors. It has as many rows as simulation 145 runs (nSimulations) and as many columns as fitted covariates and factors. Each row represents the 146 P-values of a single iteration. Column names are of the form "PCov1", "PCov2", …, "PFac1", 147 "PFac2", … (depending on what combination of covariates and factors is specified). If 148 CompareTtest was set to TRUE, a two column data frame is returned with columns "PFac1" and 149 "PTtest". The second column represents the P-values of a t-test. 150 2. A data frame ResCCC. It has as many rows as simulation runs (nSimulations) and six columns. 151 Each row represents the following estimates of a single iteration: Column "PSWY" contains all P-152 values of a Shapiro-Wilk test for normality of the dependent variable Y, column "PSWRes" 153 contains all P-values of a Shapiro-Wilk test for normality of the residuals, column "rho" contains 154 the concordance correlation coefficient (Lin 1989) between observed and expected residuals 155 assuming normality for each model. We use the qqnorm() function to generate the expected values. 156 Columns "s.shift", "l.shift" and "C.b" contain the scale shift, the location shift and the bias 157 correction factor of the concordance correlation (Lin 1989 10 mean(log10(ResCCC$PSWRes)) 168 169 Columns "QL.PSWR" and "QU.PSWR" provide the lower and upper 95% quantiles of the P-170 values of the Shapiro-Wilk tests for normality of the residuals. Column "Mean.PSWY" contains 171 the mean P-value of the Shapiro-Wilk tests for normality of the dependent variable Y, calculated as 172 173 10 mean(log10(ResCCC$PSWY)) 174 175 Columns "QL.PSWY" and "QU.PSWY" provide the lower and upper 95% quantiles of the P-176 values of the Shapiro-Wilk tests for normality of the dependent variable. 177 178 6. If the argument SaveAllOutput is set to TRUE, a list Data that contains as many data frames as 179 there are dependent and independent variables fitted. The data frames are named "Dependent", 180 "Cov1", "Cov2", … "Fac1", "Fac2", … (depending on what combination of covariates and factors 181 is specified). In each of these data frames the values of Y (Dependent) and X (all the other data 182 frames) are stored. Each of these data frames has as many rows as simulation runs (nSimulations) 183 and as many columns as the specified sample size (nSamples). Each row represents the data values 184 of a single iteration. 185

186
The function TrustGaussTypeII() returns one more element 187 188 7. A data frame Effects with the parameter estimates of the covariate with an added effect. It has as 189 many rows as simulation runs (nSimulations) and three columns. Each row represents the 190 parameter estimates of a single iteration. Column "Intercept" contains the estimate of the intercept, 191 column "Estimate" contains the slope estimate and column "SE" the standard error of the slope 192 estimate.

194
The Thus, it has as many elements as simulation runs (nSimulations). "GaussianCategorical", "GaussianZero", "GaussianZeroCategorical", "AbsoluteGaussian" and 208 "GammaCategorical" make use of additional functions to generate categories, to take absolute values 209 or to introduce zero-inflation after a call to rnorm() or rgamma(), thereby changing the specified mean 210 and standard deviation or shape and scale. Assume we want to run a simulation with 100 observations in each of 50,000 iterations. We specify 216 the Family argument as "Gaussian" to fit a linear model with identity link. See the glm() 217 documentation for details on the Family arguments. This is equivalent to fitting a linear model 218 assuming normally distributed errors. Since we might also be interested in the individual observations 219 of each simulation run, we specify SaveAllOutput=TRUE. This will save the 100 × 50,000 = 220 5,000,000 data points of the dependent variable Y and the 5,000,000 observations of the predictor X. 221 We want the dependent variable Y to be distributed as the absolute values of a Gaussian distribution 222 with mean 0 and standard deviation 1. The independent variable X should be a single covariate that is 223 Gamma distributed with shape 1.5 and scale 10. Thus, our call to TrustGauss() looks like this 224 225 > sim <-TrustGauss(Family="gaussian", nSamples=100, 226 nSimulations=50000, SaveAllOutput=TRUE, 227 DistributionY="AbsoluteGaussian", DistributionXCov="Gamma", 228 DistributionXFac=NULL, MeanY.gauss=0, SDY.gauss=1, 229 ShapeX.gamma=1.5, ScaleX.gamma=10) 230 231 A progress bar is going to indicate the progress of the simulation, which is updated after every 232 iteration. As soon as the simulation has finished, we can access every element of the resulting list. For 233 example, if we want to obtain the data frame of individual P-values for every iteration in this call to 234 TrustGauss(), we can access it via 235 236 > sim$Pval 237 238 A summary of the type I error rate at an α-level of 0.05 is accessible through 239 240 > sim$Alphas.05 241 242 In this call to TrustGauss() with a single covariate, this will yield only a single value for the 243 independent variable X. If we had multiple covariates or factors specified, a separate type I error rate 244 for each of them would have been displayed.

246
Description of the basic simulations 247 After specifying the input arguments for the TrustGauss()function (see above), the simulation starts 248 by generating uncorrelated data for the dependent variable Y and the independent variable X according 249 to the input parameters. Following our above example, specifying a sample size of nSamples = 100 250 and the distribution of Y as DistributionY = "AbsoluteGaussian" with MeanY.gauss = 0 and 251 SDY.gauss = 1, the function generates data for Y as 252 253 > y <-abs(rnorm(n=100, mean=0, sd=1)) 254 255 Similarly, for the independent variable X we specify DistributionX = "Gamma" with ShapeX.gamma 256 = 1.5 and ScaleX.gamma = 10. Then, the function generates data for X as 257 258 > x <-rgamma(n=100, shape=1.5, scale=10) 259 260 After the data generating step, two linear models are fitted using the Family argument (argument 1) 261 from above 262 263 > mod1 <-glm(y ~ x, family=Family) 264 > mod2 <-glm(y ~ 1, family=Family) 265 266 The models are compared via 267 268 > anova(mod1, mod2) 269 270 We keep the P-value of this model comparison and start the next iteration by generating data for Y 271 and X again. The number of iterations was set to nSimulations = 50,000 in all our simulations. We 272 further set Family = "gaussian", with the exception of models with a Poisson error structure that are 273 highlighted as such in the main text.

275
Because the data for Y and X are uncorrelated, we expect 5% of all models to yield a P-value ≤ 0.05. 276 If more than 5% of all models have a P-value ≤ 0.05, then the type I error rate is inflated (i.e. too 277 many models yield "significant" results), whereas if less models have a P-value ≤ 0.05, then they are 278 conservative, yielding too few "significant" results. In Figure 1, combinations of Y and X that produce 279 inflated type I error rates are coloured red and those yielding conservative P-values are coloured blue. 280 281

Introduction of heteroscedasticity 282
First, we sampled the independent variable X from a binomial distribution, where we varied the 283 success rate from 0.2 to 0.8 in steps of 0.1. Whenever the independent variable X was equal to 0, we 284 sampled the dependent variable Y either from distribution D0 or D7 (see Table 1). We also introduced 285 a third distribution D7.1, which was negative binomial with mean 0.5 and a variance of 1 (with the 286 rational of introducing the same absolute difference in variance as in D0). Whenever the independent 287 variable X was equal to 1, we sampled the dependent variable from the same distribution but with a 288 10-times larger variance. We assessed the effects of heteroscedasticity with sample sizes of N = 100 289 and N = 1000. We then fitted a glm either with a Gaussian or a Quasipoisson error structure, where we 290 tested the significance of the independent variable X via a likelihood ratio test. We fitted these models 291 to 50,000 datasets for each combination of the dependent and predictor variable with two sample sizes 292 (i.e. 3 distribution of the dependent variable × 7 distributions of the independent variable × 2 sample 293 sizes = 42 combinations, each with a Gaussian or a Quasipoisson error structure).

295
Second, we introduced a second independent variable X that we sampled from a uniform distribution 296 with five levels. The other predictor and the dependent variable Y were sampled as described above 297 with sample sizes of N = 100 and N = 1000. We then fitted a glm either with a Gaussian or a 298 Quasipoisson error structure, where we tested the significance of the interaction between the two 299 independent variables via a likelihood ratio test. We fitted these models to 50,000 datasets for each 300 combination of the dependent and predictor variables with two sample sizes as well.

302
For each simulation run, we recorded the variance in the two groups (as defined by the predictor 303 variable with two levels, see above). We summarized the 50,000 simulation runs by calculating (1) 304 the type I error rate as the number of simulations with a P-value ≤ 0.05 / the number of all simulation 305 runs (i.e. 50,000) and (2) the mean observed difference in variances between the two groups (see 306  Table S5).  Tables   377  378  Table S1 | Observed and expected power of a regression model in which both the dependent variable 379 Y and the predictor X are normally distributed with mean 0 and standard deviation 1. The expected 380 power was calculated using the power.SLR() function from the powerMediation R package (v0.