# Methods to test for equality of two normal distributions

## Abstract

Statistical tests for two independent samples under the assumption of normality are applied routinely by most practitioners of statistics. Likewise, presumably each introductory course in statistics treats some statistical procedures for two independent normal samples. Often, the classical two-sample model with equal variances is introduced, emphasizing that a test for equality of the expected values is a test for equality of both distributions as well, which is the actual goal. In a second step, usually the assumption of equal variances is discarded. The two-sample *t* test with Welch correction and the *F* test for equality of variances are introduced. The first test is solely treated as a test for the equality of central location, as well as the second as a test for the equality of scatter. Typically, there is no discussion if and to which extent testing for equality of the underlying normal distributions is possible, which is quite unsatisfactorily regarding the motivation and treatment of the situation with equal variances. It is the aim of this article to investigate the problem of testing for equality of two normal distributions, and to do so using knowledge and methods adequate to statistical practitioners as well as to students in an introductory statistics course. The power of the different tests discussed in the article is examined empirically. Finally, we apply the tests to several real data sets to illustrate their performance. In particular, we consider several data sets arising from intelligence tests since there is a large body of research supporting the existence of sex differences in mean scores or in variability in specific cognitive abilities.

### Keywords

Fisher combination method Minimum combination method Likelihood ratio test Two-sample model## 1 Introduction

Statistical tests for two independent samples under the assumption of normality are applied routinely by most practitioners of statistics. Likewise, statistical inference for two independent normal samples is of great relevance in every introductory statistics course. There, the approach is often quite similar: First, the importance of shift models is stated, motivating the classical two-sample model with equal variances (see, e.g., Bickel and Doksum 2006, page 4). The ultimate aim is to compare both distributions. If normality is assumed, this corresponds to a test for equality of the expected values, i.e. Student’s *t* test. In a second step, usually the assumption of equal variances is discarded. The two-sample *t* test with Welch correction is introduced, however, at most times without going into details of Welch’s distribution approximation. The introduction and adjacent discussion on the *F* test for equality of variances often varies in the level of detail. Welch’s *t* test is solely treated as a test for the equality of central location, as well as the *F* test as a test for the equality of scatter. Typically, there is no discussion if and to which extent testing for equality of the underlying normal distributions is possible. Not only is this astonishing looking at the motivation of the classical *t* test, but also due to (at least) two other reasons: For one thing lectures continue with general procedures for testing nested parametric models, including in particular likelihood-ratio tests. For another, when it comes to dealing with the one-way anova, you rarely fail to see the problem of multiple testing being mentioned, along with suitable corrections including, most of the times, the Bonferroni correction.

In some textbooks testing for equality of variances is merely left as an exercise if not outright skipped. A possible reason for this could be seen in the non-robustness of this particular test against deviances from the normal distribution. Still, as no alternative tests are at least alluded, students get the impression that differences in scatter are more or less irrelevant - variance is a statistical Cinderella. Yet, everybody actually applying statistical procedures knows very well that differences in variance and location are of comparable importance.

Summing up, it can be said that from a practical point of view, given a two-sample model under normality, the aim has to be to judge whether the two samples originate from basically similar distributions or not. However, in many cases the classical and, of course, very comfortable assumption of equal variances has no grounding. In the midst of these considerations, discussion in lectures and textbooks stops without further ado and the students (and maybe some lecturers as well) are left without a clue how to deal with this situation.

*t*test, we do not make further assumptions about the parameters, so that \(\left( \mu ,\nu ,\sigma ^2,\tau ^2\right) \in \Theta =\mathbb {R}^2\times \left( 0,\infty \right) ^2\) is arbitrary. It is the objective to test if the two samples stem from identical distributions. The corresponding testing problem is given by the following hypothesis and alternative:The first classical approach is to develop a likelihood-ratio test. Doing so is a simple way to obtain an asymptotically valid test. In Sect. 2 the likelihood-ratio test statistic is derived, and different approximations of the distribution of the test statistic under \(H_0\) found in the literature are summed up. Among them one can find an asymptotic expansion proposed by Muirhead (1982), as well as a recently developed method to derive the exact distribution by numerical integration (Zhang et al. 2012).

A further approach is to combine different *p* values as illustrated in Sect. 3. For this procedure the hypothesis \(H_0\) is obtained by combining the hypotheses of both *t* and *F* test. Performing both tests using the same data \((x_1,\ldots ,x_m)\) and \((y_1,\ldots ,y_n)\), the resulting *p* values can be combined yielding a new test statistic and, thus, a test result for (1). Most combination methods require the tests to be combined being independent under \(H_0\) which holds in the case under consideration. In the specific case of Fisher’s method, the same approach, but applied in a slightly different way, can be found in Perng and Littell (1976).

In Sect. 4, power of the different tests is compared empirically. The ability of each method to correctly detect the alternative differs with respect to whether there is a difference in expectation, variance, or both. Loughin (2004) compares the method of combining the *p* values without regard to a specific testing problem. However, it is instructive to apply these methods directly to the problem at hand and compare them with the likelihood-ratio tests in Sect. 2.

Situations where one is interested in differences in variability as well as in means can be found almost everywhere. A long list of such applications is compiled in Gastwirth et al. (2009). We discuss in Sect. 5 several examples from two subject areas, namely engineering and psychology. In particular, we consider several data sets arising from mental or intelligence tests since there is a large body of research supporting the existence of sex differences in specific cognitive abilities, some favouring men, some favouring women, sometimes differences are found in mean scores, or in variability, or in both.

## 2 The likelihood ratio test

*p*-quantile of the \(\chi ^2\)-distribution with 2 degrees of freedom. Typically, fairly large sample sizes are needed to use these asymptotic results for finite samples. However, there are several approaches available to transform the test statistic or to determine a more exact distribution in order to improve the finite sample behaviour.

Pearson and Neyman (1930) directly considered \(\Lambda _{m,n}\), showing that under \(H_0\), the limiting distribution is the uniform distribution *U*(0, 1) [note that, if *Z* is uniformly distributed, \(-2\log Z\) is exponentially distributed with mean 2, or \(\chi ^2_2\)-distributed; hence, this result is in agreement with (2)]. They proposed to approximate the exact distribution of \(\Lambda _{m,n}\) for finite *n* and *m* by a beta distribution matching the first two moments.

Muirhead (1982) considered an asymptotic expansion of the distribution of the likelihood ratio test statistic under multivariate normality; in the univariate case, we obtain the following corollary.

**Corollary 2.1**

Comparison of the \(\chi ^2\)- and Muirhead-approximation for \(m=10\) and \(n=20\)

| Asymptotic \(\chi ^2\) | Muirhead approximation | ||
---|---|---|---|---|

\(F^{-1}_{\chi ^2_2}(p)\) |
| \(F^{-1}_{10,20}(p)\) |
| |

0.75 | 2.77 | 3.13 | 2.70 | 2.79 |

0.90 | 4.61 | 5.19 | 4.46 | 4.64 |

0.95 | 5.99 | 6.74 | 5.77 | 6.02 |

0.99 | 9.21 | 10.33 | 8.74 | 9.22 |

0.999 | 13.82 | 15.48 | 12.78 | 13.83 |

*Remark*

In many cases, likelihood ratio tests exhibit some kind of optimality. Hsieh (1979) has shown that for the testing problem under consideration, the likelihood ratio test is asymptotically optimal in the sense of Bahadur efficiency.

## 3 Combination of *p* values

*p*values of Student’s

*t*test and the

*F*test for equality of variances. For this purpose, the hypothesis \(H_0\) has to be rephrased as a multiple one. Using the hypothesis and alternative of the

*t*test

*F*test

*t*test

*t*-distribution with \(m+n-2\) degrees of freedom under \(H_0^{\prime }\), while the

*F*test statistic

*p*values like Fisher’s method (Fisher 1932) presume the independence of the corresponding

*p*value under the hypothesis. To prove the independence of

*T*and

*Q*under \(H_0\), one can invoke Basu’s theorem to prove the independence of the sum and the quotient of two independent \(\chi ^2\)-distributed random variables (Lehmann and Romano 2005, pp. 152–153). This result is applied to \((m-1)S^2_X/\sigma ^2\) and \((n-1)S^2_Y/\sigma ^2\) to derive the independence of

*Q*and \(S^2\) under \(H_0\). Since \(S_X^2\), \(S_Y^2\), \(\bar{X}\) and \(\bar{Y}\) are independent,

*T*and

*Q*are independent as well. After having shown the independence of

*t*and

*F*test statistics, we can combine the

*p*values

*p*value of the usual two-sided

*F*test with equal tail probabilities.

The crucial fact in the following examples of combination methods proposed in the literature is the following: if the distribution of the test statistic under \(H_0\) is unique and continuous, then the *p* value, considered as a random variable, follows the uniform distribution on the unit interval under \(H_0\) (this fact is often stated only for simple null hypotheses which is overly restrictive).

### 3.1 Combination method due to Fisher

For the testing problem given in (1), Perng and Littell (1976) proved that the test based on \(M_1\), just as the likelihood ratio test in Sect. 2, is asymptotically optimal in the sense of Bahadur efficiency (see also Singh 1986).

This is not too astonishing regarding the close relation between both tests. Fisher’s method combines both tests as a product with equal weights under \(H_0\). On the other hand, as shown by Pearson and Neyman (1930), the squared *t* test statistic and the *F*-statistic are one-to-one correspondences of the likelihood ratio statistics for testing \(H^{\prime }_0\) against \(H^{\prime }_1\) and \(H^{\prime \prime }_0\) against \(H^{\prime \prime }_1\), respectively. They showed further that the likelihood ratio for testing \(H_0\) against \(H_1\) can be expressed as the product of the likelihood ratio for testing \(H^{\prime }_0\) against \(H^{\prime }_1\) and the likelihood ratio for testing \(H^{\prime \prime }_0\) against \(H^{\prime \prime }_1\). Hence, the likelihood ratio test combines both tests as a product with approximately equal weights under \(H_0\) (since they have the same limiting distribution under \(H_0\)).

### 3.2 Minimum combination method and Bonferroni correction

### 3.3 Maximum and sum combination methods

*p*values. Using the maximum statistic \(M_3=\max (G_1(T),G_2(Q))\), the corresponding test of level \(\alpha \) is given by

### 3.4 Combination methods due to Stouffer–Lipták and due to Mudholkar and George

The logit statistic has the same exact Bahadur slope under \(H_0\) as Fisher’s statistic. Hence, it is also optimal in the sense of Bahadur efficiency (Mudholkar and George 1979; Berk and Cohen 1979).

*p*value is large, regardless how small the other is. He also states that these combination methods can be useful, given the circumstance that both

*p*values are equally significant. Thus, it is crucial to compare the tests empirically under different alternatives.

It should be noted that the presented combination rules can also be used in other situations, for example for combining dependent tests. A description and comparison of combination rules in a nonparametric context can be found in Pesarin and Salmaso (2010, pp. 128–134).

## 4 Empirical level and power of the tests

In the following, the tests are compared at level \(\alpha =0.05\), whereby the exact 0.95-quantile of \(M_6\) was used as well as the exact 0.95-quantile of \(\Lambda _{m,n}\), calculated by (5), which is 6.93032 for \(m=n=10\), 6.430465 for \(m=n=20\) and 6.657326 for \(m=10,~n=30\). The values for \(m=n\) coincide with the values given in Nagar and Gupta (2004). To achieve such a high accuracy, we used very high numbers of quadrature points in evaluating the double integral in (5) by Gauss–Legendre quadrature together with an extrapolation method.

*p*values \(G_1\) and \(G_2\) were simulated \(10^5\) times with sample sizes of \(m=n=20\) and under fixed parameters \(\sigma ^2=\tau ^2=1,~\mu =0\). Table 2 shows the empirical power of the tests for varying expectation \(\nu \). Clearly, the power should only depend on the absolute value of \(\nu \), which is the case within the simulation accuracy.

Empirical power for varying \(\nu \), \(\sigma ^2=\tau ^2=1, \mu =0\) and \(m=n=20\)

\(\nu \) | Fisher | Minimum | Maximum | Edington | Stouffer | Mudholkar | LQ-test |
---|---|---|---|---|---|---|---|

\(-\)1.5 | 0.986 | 0.991 | 0.223 | 0.314 | 0.903 | 0.962 | 0.988 |

\(-\)1 | 0.769 | 0.798 | 0.216 | 0.291 | 0.613 | 0.698 | 0.781 |

\(-\)0.5 | 0.252 | 0.258 | 0.143 | 0.164 | 0.215 | 0.233 | 0.256 |

0 | 0.050 | 0.050 | 0.050 | 0.050 | 0.050 | 0.050 | 0.050 |

0.5 | 0.251 | 0.257 | 0.142 | 0.163 | 0.215 | 0.232 | 0.255 |

1 | 0.770 | 0.799 | 0.218 | 0.292 | 0.618 | 0.701 | 0.782 |

1.5 | 0.986 | 0.990 | 0.222 | 0.313 | 0.903 | 0.962 | 0.988 |

*t*test is not a level \(\alpha \)-test if the variances differ. However, it is well-known that the

*t*test is quite robust against the violation of homogeneity of variances; a quantitative statement in this direction can be found in Perng and Littell (1976, p. 970).

Empirical power for varying \(\tau \), \(\sigma ^2=1, \mu =\nu =0\) and \(m=n=20\)

\(\tau \) | Fisher | Minimum | Maximum | Edington | Stouffer | Mudholkar | LQ-test |
---|---|---|---|---|---|---|---|

0.2 | 1.000 | 1.000 | 0.230 | 0.321 | 0.998 | 1.000 | 1.000 |

0.6 | 0.459 | 0.480 | 0.180 | 0.224 | 0.365 | 0.411 | 0.473 |

1 | 0.050 | 0.050 | 0.049 | 0.050 | 0.049 | 0.049 | 0.050 |

1.4 | 0.216 | 0.222 | 0.126 | 0.142 | 0.183 | 0.198 | 0.221 |

1.8 | 0.585 | 0.609 | 0.198 | 0.253 | 0.459 | 0.523 | 0.600 |

2.2 | 0.853 | 0.870 | 0.223 | 0.301 | 0.701 | 0.789 | 0.864 |

2.6 | 0.959 | 0.966 | 0.226 | 0.313 | 0.849 | 0.921 | 0.964 |

Average *p* values for different choices of \(\nu \) and \(\tau \)

No. | \(\nu \) | \(\tau \) | \(\bar{G}_1\) | \(\bar{G}_2\) |
---|---|---|---|---|

1 | 0.000 | 1.00 | 0.50 | 0.50 |

2 | 0.075 | 1.05 | 0.49 | 0.49 |

3 | 0.150 | 1.10 | 0.47 | 0.48 |

4 | 0.225 | 1.15 | 0.44 | 0.45 |

5 | 0.300 | 1.20 | 0.40 | 0.41 |

6 | 0.375 | 1.25 | 0.36 | 0.37 |

7 | 0.450 | 1.30 | 0.31 | 0.33 |

8 | 0.525 | 1.35 | 0.27 | 0.30 |

9 | 0.600 | 1.40 | 0.24 | 0.26 |

10 | 0.675 | 1.45 | 0.20 | 0.23 |

11 | 0.750 | 1.50 | 0.17 | 0.20 |

12 | 0.825 | 1.55 | 0.15 | 0.17 |

13 | 0.900 | 1.60 | 0.13 | 0.14 |

14 | 0.975 | 1.65 | 0.11 | 0.12 |

15 | 1.050 | 1.70 | 0.09 | 0.10 |

16 | 1.125 | 1.75 | 0.08 | 0.09 |

17 | 1.200 | 1.80 | 0.07 | 0.07 |

*p*values of the

*t*and

*F*test are nearly equal on average (see Table 4; Fig. 4). To this end, the statistics \(T\) and \(Q\) were simulated \(10^5\) times in order to estimate \(E_\vartheta (G_1)\) and \(E_\vartheta (G_2)\) by their arithmetic means.

Empirical power for varying \(\nu \) and \(\tau \), \(\mu =0, \sigma ^2=1\) and \(m=n=20\)

No. | Fisher | Minimum | Maximum | Edington | Stouffer | Mudholkar | LQ-test |
---|---|---|---|---|---|---|---|

1 | 0.050 | 0.050 | 0.049 | 0.049 | 0.049 | 0.049 | 0.049 |

3 | 0.076 | 0.074 | 0.069 | 0.071 | 0.074 | 0.075 | 0.076 |

5 | 0.159 | 0.148 | 0.129 | 0.139 | 0.153 | 0.156 | 0.158 |

7 | 0.294 | 0.260 | 0.234 | 0.257 | 0.287 | 0.293 | 0.291 |

9 | 0.455 | 0.393 | 0.364 | 0.401 | 0.449 | 0.454 | 0.448 |

11 | 0.621 | 0.539 | 0.509 | 0.555 | 0.617 | 0.624 | 0.612 |

13 | 0.758 | 0.671 | 0.637 | 0.689 | 0.757 | 0.763 | 0.749 |

15 | 0.859 | 0.783 | 0.742 | 0.791 | 0.855 | 0.862 | 0.852 |

17 | 0.922 | 0.864 | 0.822 | 0.864 | 0.921 | 0.925 | 0.917 |

We rerun all simulations with smaller sample sizes \(m=n=10\). Apart from generally lower values of the power, all of the aforementioned conclusions remain unchanged. Further, we rerun all simulations with unequal sample sizes, namely \(m=10\) and \(n=30\), hence maintaining the total sample size. Under this scenario the empirical power decreases under every alternative compared to the balanced case \(m=n=20\). Thereby, the power under a changing \(\tau \) declines more than under a changing \(\nu \). Given the alternative that both parameters change, the LQ-test performs considerably best and the minimum method worst.

*Remark*

The statements about the asymptotic optimality of the likelihood ratio test or the combination method of Fisher are at first sight surprising. Consider, for example, the first simulation scenario with a difference between means but equal variances. Then, the optimality property of, say, Fisher’s method means that we lose no power in a specific asymptotic sense if we use this method instead of the *t* test which is well-known to be optimal in this situation for any finite sample size.

Relative efficiency of a sequence of tests \(\{S_n\}\) with respect to another sequence \(\{T_n\}\) is the ratio of the sample sizes necessary for \(\{S_n\}\) and \(\{T_n\}\) in order to attain the power \(\beta \) under the level \(\alpha \) for a specific alternative. Bahadur asymptotic relative efficiency considers the limit of this ratio for a sequence of levels decreasing to zero keeping \(\beta \) and the alternative fixed.

To compute relative efficiencies exemplarily we fix power \(\beta =0.6\) and mean \(\nu = 0.5\), and consider four decreasing values of \(\alpha \), namely \(0.187, 0.027, 0.0012, 3\cdot 10^{-6}\). For the balanced case \(m=n\), the *t* test needs sample sizes 20,50,100 and 200 to reach power \(\beta =0.6\), whereas the combination method of Fisher has the same power for sample sizes 27,64,122 and 230. Hence, the relative efficiencies for the four different levels are 0.74, 0.78, 0.82 and 0.87. Indeed, relative efficiency increases, but even for large sample sizes (and, correspondingly, small levels) it is far away from 1, the limit for \(\alpha \rightarrow 0\) given by theory.

## 5 Data examples

We applied the tests to several data sets exemplifying the behaviour of the presented methods.

*Example 1*

*p*values are given in the second line of Table 6; there, we added the usual significance codes to the

*p*values for a fast overview: \(p^{***}\) if \(p\le 0.001\), \(p^{**}\) if \(0.001<p\le 0.01\), \(p^{*}\) if \(0.01<p\le 0.05\), and \(p^{\circ }\) if \(0.05<p\le 0.10\).

*p* values for the different testing methods applied to examples 1–3

Ex. | Fisher | Minimum | Maximum | Edington | Stouffer | Mudholkar | LQ-test |
---|---|---|---|---|---|---|---|

1 | \(0.0058^{**}\) | \(0.030^{*}\) | \(0.0021^{**}\) | \(0.0019^{**}\) | \(0.0033^{**}\) | \(0.0045^{**}\) | \(0.0072^{**}\) |

2a | \(0.0007^{***}\) | \(0.0017^{**}\) | \(0.0059^{**}\) | \(0.0030^{**}\) | \(0.0006^{***}\) | \(0.0006^{***}\) | \(0.0008^{***}\) |

2b | 0.11 | \(0.045^{*}\) | 1 | 0.52 | 1 | 1 | \(0.074^{\circ }\) |

3 | \(0.0098^{**}\) | \(0.040^{*}\) | \(0.0039^{**}\) | \(0.0034^{**}\) | \(0.0057^{**}\) | \(0.0075^{**}\) | \(0.012^{*}\) |

For the data at hand, the *p* value of the minimum combination method is considerably larger than for all other methods, whereas all remaining *p* values are comparable and smaller than 0.01. Since the tests find a significant difference between the two underlying distributions, one could informally proceed as proposed in Sect. 3 of Zhang et al. (2012), first applying the *F* test (\(p=0.015\)) followed by Welch’s *t* test [\(p=0.050\), cf. Bickel and Doksum (2006, pp. 264), or Aspin and Welch 1949]. Clearly, one has to be careful when formally reporting the results of such follow-up tests. The results of this example are in agreement with the simulation results: the minimum method has comparably low power when there exist differences in the means as well as in the variances.

*Example 2a,b*

*p*values for the tests for equality of distributions. Here, the values for maximum, sum and minimum method are larger than the remaining values, but all tests yield a significant result on the 0.01-level. Since there is a significant difference between the two distributions, we applied the two-sample

*t*test (\(p=0.077\)) and the

*F*test (\(p=0.0008\)), indicating a significantly larger variability in the results of males compared to females.

*p*values of the tests can be found in line 4 of Table 6. Here, only the minimum combination method shows a significant result on the 0.05-level, followed by the LQ-test and Fisher’s combination method resulting in

*p*values of 0.074 and 0.11. The remaining

*p*values are larger than 0.5. Here, the

*F*test yields a

*p*value of 0.023. Clearly, the minimum method is not affected by the large

*p*value of the

*t*test, and hence, performs best in this example.

It should be noted that in this and the following example we are dealing with large samples where it is possible that very small differences are statistically significant but may be of not much scientific or practical importance.

*Example 3*

The results in line 5 of Table 6 show that, as in example 1, the *p* value of the minimum method is considerably larger than for all other methods, followed by the LQ-test and Fisher’s method with *p* values around 0.01. Since there are significant differences between the underlying distributions, we applied the two sample *t* test (\(p=0.062\)) and *F* test (\(p=0.020\)), again indicating a larger variability in the test scores of male students. In this example, it is clearly noticeable that the *p* values of all combination tests except the minimum method can be much smaller than the *p* values of the individual *t* and *F* tests.

## 6 Discussion

To sum up the results of the simulation study, it is clear that the maximum and the sum combination methods should not be used due to its inability to detect many alternatives. From Fig. 1, one could expect that both methods might be useful if differences in location come along with differences in variability which is the rule rather than the exception in many biometrical applications. However, the simulations show that these methods are inferior to other combination methods even in the case of location-scale differences. Furthermore, the minimum combination method is not really recommendable due to its comparably low power when there exist differences in the means as well as in the variances. There is not much to choose from the remaining tests. In terms of power, the likelihood ratio test has the edge over the other methods at least in unbalanced situations, whereas the Fisher combination method stands out due to its simplicity.

Even if the data examples corroborate the previous findings, the performance of the different tests for specific data sets may be astonishing. As always in such situations, there is a danger that one performs several tests, and chooses a specific one afterwards for reporting.

Like the

*F*test for the homogeneity of variances, all tests previously described are sensitive to the assumption that the data are drawn from underlying Gaussian distributions. This assumption should be checked by diagnostic plots. There are various more robust (and less efficient) competitors to the*F*test available (see, e.g., Marozzi 2011), but combination of these tests with tests for equality of location are not straightforward since the test statistics are not independent. The same holds for combinations of nonparametric tests like the Lepage test (Lepage 1971; Marozzi 2013). There also exists nonparametric location-scale tests like the Cucconi rank test (Cucconi 1968) which are not combination tests. Marozzi (2009) shows that the Cucconi test is a powerful alternative to the Lepage test and suggests to carry out the test as permutation test. Clearly, such tests can be preferable in specific applications.It is certainly possible to cover one or more of the presented methods in classroom. At least, it should be made clear that the combination of the

*t*and the*F*test using a Bonferroni correction leads to a valid test for \(H_0\) against \(H_1\) at level \(\alpha \).If one accepts (or proves) the independence of the tests, it is possible to discuss the more refined combination methods. Such a treatment accentuates the randomness of

*p*values, an important fact which is often obscured in classroom (Murdoch et al. 2008).Determining the likelihood ratio statistic in (2) is a worthwhile exercise, while a more or less sophisticated implementation of the likelihood ratio test of Muirhead in Corollary 2.1 is an interesting task for an accompanying statistical computing lab.

One caveat: strictly speaking, none of the tests is a diagnostic test, insofar as it is not possible to deduce differences in means or variances from a rejection of the overall hypothesis (this would be possible using Welch’s

*t*test and the*F*test with Bonferroni correction). However, nothing speaks against an informal approach as in Sect. 3 of Zhang et al. (2012).Since the minimum methods corresponds to the Bonferroni correction, and since the

*t*test is robust against violations of variance homogeneity, the minimum methods is closest to a diagnostic test.

## Notes

### Acknowledgments

The authors thank the Editor and two anonymous referees for their valuable comments on the original version of the manuscript.

### References

- Aspin AA, Welch BL (1949) Tables for use in comparisons whose accuracy involves two variances, separately estimated. Biometrika 36:290–296MathSciNetCrossRefMATHGoogle Scholar
- Berk RH, Cohen A (1979) Asymptotically optimal methods of combining tests. Journal of the American Statistical Association 74:812–814MathSciNetCrossRefMATHGoogle Scholar
- Bickel PJ, Doksum KA (2006) Mathematical statistics, basic ideas and selected topics, 2nd ed, vol 1. Pearson, LondonGoogle Scholar
- Cucconi O, (1968) Un nuovo test non parametrico per il confronto tra due gruppi campionari. Giornale degli Economisti XXVII, pp 225–248Google Scholar
- Deary IJ, Irwing P, Der G, Bates TC (2007) Brother-sister differences in the \(g\) factor in intelligence: analysis of full, opposite-sex siblings from the NLSY1979. Intelligence 35:451–456CrossRefGoogle Scholar
- Edington ES (1972) An additive method for combining probability values from independent experiments. J Psychol 80:351–363CrossRefGoogle Scholar
- Fisher RA, (1932) Statistical methods for research workers, 4th ed. Oliver & Boyd, EdinburghGoogle Scholar
- Gastwirth JL, Gel YR, Miao W (2009) The impact of Levene’s test of equality of variances on statistical theory and practice. Stat Sci 24:343–360MathSciNetCrossRefMATHGoogle Scholar
- George EO, Mudholkar GS (1983) On the convolution of logistic random variables. Metrika 30:1–13MathSciNetCrossRefMATHGoogle Scholar
- Hogg RV, McKean JW, Craig AT (2005) Introduction to mathematical statistics, 6th ed. Pearson Education, LondonGoogle Scholar
- Hsieh HK (1979) On asymptotic optimality of likelihood ratio tests for multivariate normal distributions. Ann Statist 7:592–598MathSciNetCrossRefMATHGoogle Scholar
- Jain SK, Rathie PN, Shah MC (1975) The exact distributions of certain likelihood ratio criteria. Sankhya Ser A 37:150–163Google Scholar
- Lehmann EL, Romano JP (2005) Testing statistical hypotheses, 3rd edn. Springer, BerlinGoogle Scholar
- Lepage Y (1971) A combination of Wilcoxon’s and Ansari–Bradley’s statistics. Biometrika 58:213–217MathSciNetCrossRefMATHGoogle Scholar
- Lipták T (1958) On the combinationn of independent tests. Magyar Tudományos Akadémia Matematikai Kuatató Intezetenek Kozlemenyei 3:1971–1977Google Scholar
- Loughin TM (2004) A systematic comparison of methods for combining \(p\)-values from independent tests. Comput Stat Data Anal 47:467–485MathSciNetCrossRefMATHGoogle Scholar
- Marozzi M (2009) Some notes on the location-scale Cucconi test. J Nonparametric Stat 21:629–647MathSciNetCrossRefMATHGoogle Scholar
- Marozzi M (2011) Levene type tests for the ratio of two scales. J Stat Comput Simul 81:815–826MathSciNetCrossRefMATHGoogle Scholar
- Marozzi M (2012) A combined test for differences in scale based on the interquantile range. Stat Paper 53:61–72MathSciNetCrossRefMATHGoogle Scholar
- Marozzi M (2013) Nonparametric simultaneous tests for location and scale testing: a comparison of several methods. Commun Stat Simul Comput 42:1298–1317MathSciNetCrossRefMATHGoogle Scholar
- Mudholkar GS, George EO (1979) The logit statistic for combining probabilities. In: Rustagi J (ed) Symposium on optimizing methods in statistics. Academic Press, New York, pp 345–366Google Scholar
- Muirhead RJ (1982) On the distribution of the likelihood ratio test of equality of normal populations. Can J Stat 10:59–62MathSciNetCrossRefMATHGoogle Scholar
- Murdoch DJ, Tsai Y, Adcock J (2008) P-values are random variables. Am Stat 62:242–245MathSciNetCrossRefGoogle Scholar
- Nagar DK, Gupta AK (2004) Percentage points for testing homogeneity of several Univariate Gaussian populations. Appl Math Comput 156:551–561MathSciNetMATHGoogle Scholar
- Nair VN (1984) On the behaviour of some estimators from probability plots. J Am Stat Assoc 79:823–830CrossRefMATHGoogle Scholar
- Pearson ES, Neyman J (1930) On the problem of two samples. In: Neyman J, Pearson ES (eds) Joint statistical papers. Cambridge University Press, Cambridge, pp 99–115, 1967Google Scholar
- Perng SK, Littell RC (1976) A test of equality of two normal population means and variances. J Am Stat Assoc 71:968–971MathSciNetCrossRefMATHGoogle Scholar
- Pesarin F, Salmaso L (2010) Permutation tests for complex data: theory, applications and software. Wiley, New YorkCrossRefMATHGoogle Scholar
- Shoemaker LH (1999) Interquantile tests for dispersion in skewed distributions. Commun Stat Simul Comput 28:189–205MathSciNetCrossRefGoogle Scholar
- Singh N (1986) A simple and asymptotically optimal test for the equality of normal populations: a pragmatic approach to one-way classification. J Am Stat Assoc 81:703–704MathSciNetCrossRefMATHGoogle Scholar
- Steinmayr R, Beauducel A, Spinath B (2010) Do sex differences in a faceted model of fluid and crystallized intelligence depend on the method applied? Intelligence 38:101–110CrossRefGoogle Scholar
- Stouffer S, Suchman E, DeVinnery L, Star S, Williams R (1949) The American soldier, vol I. Adjustment during army life. Princeton University Press, PrincetonGoogle Scholar
- Tippett LHC (1931) The method of statistics. Williams and Norgate, LondonGoogle Scholar
- Zhang L, Xu X, Chen G (2012) The exact likelihood ratio test for equality of two normal populations. Am Stat 66:180–184MathSciNetCrossRefGoogle Scholar