Testing normality of a large number of populations

This paper studies the problem of simultaneously testing that each of k independent samples comefrom anormal population. Themeans andvariances of thosepopulations may differ. The proposed procedures are based on the BHEP test and they allow k to increase, which can be even larger than the sample sizes.


Introduction
Testing for normality is a topic of interest that has generated and is still generating a vast literature.Some recent contributions are Ebner et al. (2022), Henze and Jiménez-Gamero (2019), Henze et al. (2019), Henze and Koch (2020), and Jelito and Pitera (2021); see the paper by Ebner and Henze (2020) for a review on normality tests.Most papers on this issue deal with testing normality for a sample, and the properties of the proposed procedures are stated as the sample size increases.This paper studies the problem of simultaneously testing normality of k univariate samples, where k can increase with the sample sizes.Moreover, k will be allowed to be even larger than the sample sizes.Specifically, we will consider the following general setting: Let X 1 = {X 1,1 , . . ., X 1,n 1 }, . . ., X k = {X k,1 , . . ., X k,n k } be k independent samples with sizes n 1 , . . ., n k , which may be different, coming from X 1 , . . ., X k ∈ R , with continuous distribution functionsF 1 , . . ., F k , respectively, and E(X 2 j ) < ∞, 1 ≤ j ≤ k. ( where N is the set of univariate normal populations, N = {N (μ, σ 2 ), μ ∈ R, σ > 0} and, as said before, k is allowed to be large (the precise meaning of "large" will be stated in the following sections).
The main motivation for testing normality comes from the fact that, under this distributional assumption, many statistical procedures become simpler and more efficient than their non-parametric counterparts, mainly due to the good properties of the normal law.Nevertheless, the efficiency of those procedures may decrease, or even disappear, if the normality assumption fails.For example, if one can assume that the populations are normal, then the classical k-sample problem becomes that of testing the equality of variances and the equality of means, for which more tests can be found in the statistical literature than for testing the equality of the k distributions when k is large (Zhan and Hart 2014;Jiménez-Gamero et al. 2022).As another instance, let us consider the problem of testing the equality of the means of a large number k of univariate normal populations, that may have different variances.Park and Park (2012) proposed two tests for this problem, whose associated statistics, conveniently normalized, are asymptotically normal.Here by asymptotic it is meant as k → ∞.
The assumption that the populations have all of them a normal distribution is crucial in order to derive the asymptotic distribution of those test statistics.In fact, the simulations in Jiménez-Gamero and Franco-Pereira (2021) show that, when the data meets the normality assumption, these tests can be more powerful than nonparametric competitors but, when data come from non-normal populations, the empirical type I errors of the tests in Park and Park (2012) can be far apart from the nominal value.
The problem of simultaneously testing goodness-of-fit for k populations has been studied in Gaigall (2021) by using test statistics based on comparing the empirical distribution function of each sample with a parametric estimator derived under the null hypothesis.The asymptotic properties studied in Gaigall (2021) are for fixed k and increasing sample sizes.Jiménez-Gamero et al. (2005) studied the problem of testing normality of the errors in multivariate, homoscedastic linear models.The test statistic in Jiménez-Gamero et al. (2005) is based on comparing the empirical characteristic function (ECF) of the studentized residuals with the characteristic function (CF) of a standard normal law.The asymptotic properties studied in Jiménez-Gamero et al. (2005) allow k to increase with the sample sizes in such a way that k 2 /n = o(1), where n is the sample size.Specifically, they show that the asymptotic null distribution of the considered test statistic coincides with that derived for independent, identically distributed (iid) data in Baringhaus and Henze (1988) and Henze and Wagner (1997).The normality test in Baringhaus and Henze (1988) and Henze and Wagner (1997) is usually called the BHEP test, since it was first proposed for the univariate case by Epps and Pulley (1983), and then extended to the multivariate case by Baringhaus and Henze (1988); moreover, due to its nice properties, it has been extended in several directions: to testing normality of the errors in homoscedastic linear models in Jiménez-Gamero et al. (2005), as explained before; to testing normality of the errors in nonparametric regression in Hušková and Meintanis (2010) and Rivas-Martínez and Jiménez-Gamero (2018); to testing normality of the innovations in GARCH models in Jiménez-Gamero (2014), Klar et al. (2012); and to testing Gaussianity of random elements taking values in a Hilbert space in Henze and Jiménez-Gamero (2021), just to cite a few.
In this paper we first study the test in Jiménez-Gamero et al. (2005) for testing H 0 , without assuming that the populations are homoscedastic.Not assuming homoscedasticity greatly complicates theoretical derivations, since instead of estimating one variance from the pooled data, now we must dealt with k variance estimators.With this aim, it will be assumed that the sample sizes are comparable in the following sense: ∀i, for some fixed constants c 0 and C 0 . (2) It is shown that the asymptotic null distribution of the test statistic also coincides with that derived for iid data, whenever k/m = o(1).Notice that if the sample sizes satisfy (2), then the condition k/m → 0 is equivalent to k 2 /n → 0.
Since the practical calculation of the BHEP test statistic involves O(n 2 ) sums (see Section 2), its computation can be rather time-consuming for large k.So, for the case k/m → ∈ (0, ∞], we try other strategies for testing H 0 .First, inspired by the random projection procedure in Cuesta-Albertos et al. (2006), we could test H 0 using not all data sets but a "small" number k 0 of samples (small in the sense that k 0 /m = o(1)) randomly selected from all of the k population samples.Finally, we consider a test statistic which combines the BHEP test statistics calculated in each sample.
As said before, this paper studies BHEP-based statistics for testing H 0 .Other test statistics could be considered.The main reason for our choice is the good properties enjoyed by this test.This is why, Section 2 starts by reviewing its definition and some properties.This section also derives new properties that will be used in Section 5. Specifically, it is shown that the two first moments of the null distribution of the BHEP test statistic converge to those of the asymptotic null distribution, and sufficient conditions are given for such convergence to hold under alternatives.Section 3 studies the test that compares the ECF of the studentized data with the CF of the standard normal law which, in our view, is the natural extension of the BHEP test statistic to the setting in (1).Section 4 studies the test in the previous section when it is calculated in a subset of randomly selected samples.Section 5 studies a test whose test statistic is based on the sum of the BHEP test statistics calculated in each sample.The properties investigated in Sects.3-5 are asymptotic.In order to assess the finite sample performance of the proposals, a simulation study was carried out, whose results are reported in Sect.6. Section 7 summarizes the paper and comments on extensions and further research.All proofs are deferred to the last section.
Throughout the paper we will make use of the following standard notation: i = √ −1 is the imaginary unit; for any complex number x = a +ib ∈ C, with a, b ∈ R, x = a denotes the real part and x = b denotes the imaginary part; all random variables and random elements will be defined on a sufficiently rich probability space ( , A, P); the symbols E and V denote expectation and variance, respectively; P 0 , E 0 and V 0 denote probability, expectation and variance under the null hypothesis, respectively; D → means convergence in distribution of random vectors and random elements.

The test
This section revisits the BHEP test for univariate data.Let X 1 , . . ., X n (n ≥ 2) be a sample from a random variable X with continuous distribution function F and E(X 2 ) < ∞.For testing the hypothesis H 0,1 : F ∈ N , the rationale of the BHEP test is as follows: write X = n −1 n j=1 X j and S 2 = n −1 n j=1 (X j − X ) 2 for the sample mean and the sample variance, respectively, and let Y j = (X j − X )/S, 1 ≤ j ≤ n, be the so-called scaled residuals of X 1 , . . ., X n , which provide an empirical standardization of X 1 , . . ., X n .Notice that, under the assumptions made, P(S > 0) = 1, and thus Y 1 , . . ., Y n are well defined.Since, under H 0,1 and for large n, the distribution of the scaled residuals should be close to the standard normal distribution, it is tempting to compare the ECF of Y 1 , . . ., Y n , with ϕ 0 (t) = exp(−t 2 /2), which is the CF of the standard normal distribution.The BHEP test rejects H 0,1 for large values of the weighted L 2 -statistic where an unspecified integral stands for an integral over the whole real line, w β (t) is the probability density function of the normal distribution N(0, β 2 ), and β > 0 is a parameter, that must be fixed by the user.The test statistic T n,β may be written as which is a useful expression for the practical computation of T n,β .Notice that the computation of T n,β involves a double sum, so the number of required calculations is of order O(n 2 ).
Representation (4) also shows that T n,β is a function of the products and thus it is invariant with respect to affine transformations of X 1 , . . ., X n .This property implies that the null distribution of T n,β only depends on the sample size n, and on the value of β.
Critical points for several sample sizes, β = 1 and the usual values for the probability of type I error (level) can be found in Baringhaus and Henze (1988) and Henze (1990).The function cv.quan of the package mnt (Butsch and Ebner 2020) of the R language (Core Team 2020) can be used to calculate critical points of the null distribution of nT n,β for any sample size, any value of β and any level.The critical points can be also approximated by those of the asymptotic null distribution of nT n,β .Under the null hypothesis, nT n,β is asymptotically (when n → ∞) distributed as , where λ β,1 , λ β,2 , . . .are the descending sequence of positive eigenvalues of certain integral operator, and Z 1 , Z 2 , . . .are independent standard normal random variables.Since those eigenvalues can be estimated (see, for example, Ebner andHenze 2021, 2022;Meintanis et al. 2022) and hence the cumulants of W β , one could approximate the asymptotic critical values by using the Pearson system of distributions (with the help of the package PearsonDS (Becker and Klößner 2022) of the R language (Core Team 2020)).This idea was proposed by Henze (1990), who (exactly) calculated the first four cumulants of W β with β = 1.
The BHEP test is consistent against any fixed alternative and it is able to detect continuous alternatives converging to the null at the rate n −1/2 , see (Henze and Wagner 1997).Ebner and Henze (2021) and Meintanis et al. (2022) have obtained approximate Bahadur efficiencies of the BHEP test, showing that it outperforms tests based on the empirical distribution function over certain close alternatives to normality.All these properties are for n → ∞.
As in the first paragraph of this subsection, when β = 1 we denote μ 0,β and σ 2 0,β as μ 0 and σ 2 0 , respectively.We have numerically checked the approximations for β = 1, μ 0,n ≈ μ 0 and τ 2 0,n ≈ σ 2 0 , in finite sample sizes.For each n, the true values of μ 0,n and τ 2 0,n were calculated by simulation, based on 100,000 samples of size n from a standard normal law: nT n,1 were computed at each sample, and then the sample mean  and the sample variance of these 100,000 values were used to approximate μ 0,n and τ 2 0,n , respectively.Figure 1 displays μ 0,n , joint with the line y = μ 0 in red, and τ 2 0,n , joint with the line y = σ 2 0 in red, for 5 ≤ n ≤ 100.Looking at this figure one can see that the approximation for the mean, μ 0,n ≈ μ 0 , is almost exact for n ≥ 20, and the approximation for the variance, τ 2 0,n ≈ σ 2 0 , works really well for n ≥ 50.Now assume that E(X Ebner and Henze (2021) shows that Baringhaus et al. (2017) . Similar steps to those given in the proof of Proposition 1 show that a sufficient condition for the uniform integrability of for some δ > 0.

The BHEP test for H 0
The tests in Baringhaus and Henze (1988), Epps and Pulley (1983) chose β = 1.In order to simplify notation, in our developments we will also choose β = 1, although all results keep on being true for arbitrary (but fixed) β.Let Y j,r = (X j,r − X j )/S j , 1 ≤ r ≤ n j , 1 ≤ j ≤ k, where X j and S 2 j stand for the sample mean and the sample variance of the sample from X j , respectively, 1 ≤ j ≤ k.As in the one-sample case, under H 0 , the distribution of the scaled residuals should be close to the standard normal distribution.So we consider as test statistic and w(t) = w 1 (t) is the probability density function of the standard normal law.Notice that, to be more precise, the proposed test statistic should be denoted as T n 1 ,...,n k but, to simplify notation, we just write Since we are assuming that the scaled residuals Y j,r calculated from X j,1 , . . ., X j,n j coincide with those calculated from W j,1 , . . ., W j,n j .Because of this reason, we can assume that E(X j ) = 0 and V (X j ) = 1, 1 ≤ j ≤ k.Accordingly, from now on, instead of (1), it will be assumed that the setting is as follows: Another consequence of assuming that E(X j ) = 0 and V (X j ) = 1, 1 ≤ j ≤ k, is that the null distribution of T k,n only depends on the sample sizes n 1 , . . ., n k .As in the one-sample case, rejection of the null hypothesis H 0 is for large values of T k,n , say T k,n > t k,n,α where t k,n,α is the α upper percentile of the null distribution of T k,n .So, to test H 0 we must calculate upper percentiles of the null distribution of T k,n .Although the critical points can be calculated by simulation, from a practical point of view it would be nice if they could be approximated in some fashion.The next result is useful in that sense.
Before stating it, we introduce some notation.Since w(t) = w(−t) we can write Let L 2 w denote the separable Hilbert space of (equivalence classes of) measurable functions f : R → C satisfying | f (t)| 2 w(t)dt < ∞.The scalar product and the resulting norm in L 2 w will be denoted by f , Theorem 1 Suppose that (6) holds, that H 0 is true, that the sample sizes satisfy (2), Theorem 1 says that, under certain assumptions, the asymptotic null distribution of T k,n coincides with that of the one-sample test statistic, T 1,n (see Henze and Wagner 1997).Therefore, at least for large n, we can approximate the percentiles of T k,n either by those of T 1,n or by those of the asymptotic null distribution of T 1,n .In both cases, as commented in Sect.2, the percentiles can be calculated using packages of the R language.
Next we study the behavior of the test under alternatives.To this end, we first state the following result that gives the a.s.behavior of T k,n .
→ 0. Next we see that the power of the test goes to 1 for alternatives so that ϕ 0,k −ϕ 0 2 w > 0. With this aim, it will be assumed w.l.o.g. that X 1 , . . ., X r have alternative distributions with CFs ϕ 1 , . . ., ϕ r , while the other k − r populations obey H 0 , for some 1 ≤ r ≤ k. (10) In the above setting, r is allowed to vary with m, r = r m , but such dependence on m will be skipped.Let We have that Assume that the CFs of the alternative distributions satisfy inf and that the sample sizes satisfy (2), then from (11) it follows that where M 1 and M 2 are two positive constants (depending on τ , c 0 and C 0 ).As a consequence of Theorem 1, Theorem 2 and (13), we have the following result.
The result in Corollary 1 remains true if t k,n,α is replaced with a consistent estimator.
Remark 1 Assumption (12) may not be satisfied under alternatives.To see this fact, let us recall that if X has CF ϕ X = ϕ X +i ϕ X then −X has CF ϕ −X = ϕ X −i ϕ X .On the other hand, if X is a continuous random variable with probability density function (pdf) where φ(x) is the pdf of a standard normal law and π is a skewing function (i.e. a function satisfying 0 ≤ π(x) ≤ 1 and π(−x) = 1 − π(x)), then the real part of the CF of X is equal to ϕ 0 (see e.g.Jiménez-Gamero et al. 2016), and therefore 0.5ϕ X + 0.5ϕ −X = ϕ 0 .An example of a continuous law having a pdf of the form ( 14) is the skew normal law, for which π(x) = (λx), where denotes the cumulative distribution function of a standard normal law and λ ∈ R is a constant.Thus, if X 1 = X , X 2 = −X (say) and the pdf of X satisfies (14), then inf The same problem arises in the one-sample case if the assumption that the data are identically distributed is dropped.If, as in the one-sample case, we assume that X 1 , . . ., X r have the same alternative distribution, then ( 12) is satisfied.
Remark 2 Theorem 1 states that, under certain assumptions on k and the sample sizes, the asymptotic null distribution of nT k,n coincides with that of the BHEP test statistic.
On the other hand, if X 1 , . . ., X k have all of them the same CF, say ϕ, then we saw that (9) holds.As a consequence of these two facts, in this setting, the Bahadur efficiencies computed in Ebner and Henze (2021) and Meintanis et al. (2022) for the BHEP test, also apply to the test proposed in this section.
Remark 3 As explained before, the main motivation for considering T k,n is that it can be seen as the natural extension of the BHEP test statistic.Nevertheless, other test statistics can be used for testing H 0 .For example, if we denote by T i to the BHEP test statistic calculated on the sample from X i , following the approach in Gaigall (2021), other possible choices are As for the studied proposal, the null distribution of these two test statistics only depends on n 1 , . . ., n k .We will come back to a test statistic of this type in Sect. 5.
The results in this section allow k to increase with the sample size, but at a lower rate.A key result to prove Theorem 1 is Lemma 2 in Sect.8.If k/m → > 0 then, it can be checked that the results in Lemma 2 are no longer true.Moreover, since in such a case, even the practical calculation of T k,n can be very time-consuming, Sects.4 and 5 explore other strategies to build a test of H 0 .

Random selection
Because, as observed before, for large k the calculation of T k,n can be very timeconsuming, here we study a more efficient way (from a computational point of view) of testing H 0 , which consists in randomly selecting a subset of samples, and then applying the test studied in the previous section to the selected data.Specifically, the method proceeds as follows: for some (fixed) k 0 < k (the precise order of k 0 will be specified later), select randomly (without replacement) I 1 , . . ., I k 0 from 1, . . ., k and then apply the test in Sect. 3 to the samples X I 1 , . . ., X I k 0 .Let n 0 = n I 1 + . . .+ n I k 0 be the total size of the selected data.Notice that when not all sample sizes are equal (unbalanced samples), n 0 is a random quantity.Let T k 0 ,n 0 = T k 0 ,n 0 (X I 1 , . . ., X I k 0 ) denote the test statistic calculated on the selected samples.Then, H 0 is rejected if T k 0 ,n 0 ≥ t k 0 ,n 0 ,α , where t k 0 ,n 0 ,α is the α upper percentile of the null distribution of T k 0 ,n 0 for n 0 fixed (non-random) and equal to its observed value.
To study properties of this procedure we first introduce some notation.For each By construction and the definition of t k 0 ,n 0 ,α , a 0 (i) = α, ∀i, and thus the test has level α because Now we study the power.
Then the power of the test that rejects H 0 when T k 0 ,n 0 ≥ t k 0 ,n 0 ,α goes to 1, as m → ∞.
The result in Theorem 3 remains true if t k 0 ,n 0 ,α is replaced with a consistent estimator.
The consistency result in Theorem 3 is very similar to that in Corollary 1, in the sense that, besides some assumptions on k 0 , both tests are consistent under the same assumptions, namely, (12) and r k → p ∈ (0, 1].The comments in Remarks 1 and 2 also apply here. Notice that two different random selections of I 1 , . . ., I k 0 from 1, . . ., k could induce to opposite conclusions.To avoid this inconvenience, we follow Cuesta-Albertos and Febrero-Bande (2010) for tests based on random projections with functional data.These authors have proposed to take several random projections, calculate the p-value for each projection, and then apply some correction, as for example the procedure in Benjamini and Yekutieli (2001), which controls the false discovery rate.The same approach can be applied here.

Sum of BHEP test statistics
As observed in Remark 3, we could consider test statistics based on the sum of test statistics calculated on each sample.Here we study a test whose test statistic is of that type.Recall from Sect.2.2 the definition of μ 0,n and τ 2 0,n , and that 0 < μ 0,n → μ 0 and 0 < τ 2 0,n → σ 2 0 , as n → ∞.
Thus, it seems reasonable to reject H 0 for large values of T 0,k .
To test H 0 we must calculate upper percentiles of the null distribution of T 0,k , which depends on n 1 , . . ., n k .Although the critical points can be calculated by simulation, from a practical point of view it would be nice if they could be approximated in some fashion, at least for large k, since n 1 , . . ., n k can take many values.The next result shows that, under H 0 , T 0,k converges in law to a standard normal law, as k → ∞, no matter how large (or small) are the sample sizes n 1 , . . ., n k , it only assumes that n i ≥ 3.If n i = 2 then the two scaled residuals take the values -1 and 1, for any possible values of X i,1 , X i,2 (whenever they are different, which happens with probability 1 as X i is assumed to be continuous), and thus T i is a degenerate random variable.
Theorem 4 Suppose that (6) holds, that n i ≥ 3, and that H 0 is true.Then T 0,k From Theorem 4, the test that rejects H 0 when T 0,k ≥ z 1−α , for some α ∈ (0, 1), where (z 1−α ) = 1 − α, has (asymptotic) level α, where here asymptotic means for large k.Now we study the power of this test.As in the previous sections, we will suppose that (10) holds.The r alternative distributions will be assumed to satisfy the following assumption.
) converges in law to a zero-mean normal distribution.Thus, it makes sense to consider Assumption 1 (a) and (b).Moreover, (5) (for each i) was seen to be a sufficient condition for Assumption 1 (a) to hold.Since X i ,n i is positive and close to X i , Assumption 1 (c) is saying that n i must large enough so that E(T 0,k ) > 0. Table 1 below displays the values of μ 0,n and n X ,n , for some alternative distributions and some small values of n, that were calculated by simulation based on 100,000 samples in each case.The considered alternative distributions are: • The beta distribution with parameters (2,2) (b(2, 2)), whose pdf is f • the Student t-distribution with ν degrees of freedom (t ν ), whose pdf is given by where is the gamma function, • a scale mixture of two normal populations (SMN): pN (0, σ 2 ) + (1 − p)N (0, 1), with p = 0.2 and σ = 3, • the negative exponential distribution (exp) with pdf f (x) = exp(−x), x ∈ [0, ∞), and • the chi-squared distribution with ν degrees of freedom (χ 2 ν ), whose pdf is given by f Looking at Table 1 we see that Assumption 1(c) is not restrictive at all: it suffices to take n i ≥ 5 for all considered alternatives.
To derive the asymptotic null distribution of T 0,k no assumption was made on the sample sizes n 1 , . . ., n k (except that all of then are greater than or equal to 3).To study the power, we will assume that the sample sizes are comparable, in the sense that they satisfy (2).Nevertheless, in contrast to the results for the power in the previous sections (see Corollary 1 and Theorem 3), here m is not assumed to increase, it only must be large enough so that Assumption 1(c) holds true.Theorem 5 Suppose that (6), ( 10) and Assumption 1 hold, that the sample sizes satisfy (2), and that r /k → p ∈ (0, 1], as k → ∞.Then the power of the test that rejects H 0 when T 0,k ≥ z 1−α goes to 1, as k → ∞.

Simulation results
This section presents the results of several simulation experiments designed to study the finite sample performance of the three tests studied in this paper.We first study the goodness of the approximations given to the null distribution of the proposed test statistics, and then their powers, which are also compared with the following procedures for testing H 0 : • Compute the BHEP test for testing H 0i : X i ∈ N , 1 ≤ i ≤ k.Then one can apply either the Bonferroni method, which controls the family-wise error rate, or the Benjamini-Hochberg method (see Benjamini and Yekutieli 2001), which controls the false discovery rate when the k tests are independent.Both procedures agree in rejecting H 0 if min 1≤i≤k p i ≤ α/k, where p 1 , . . ., p k are the p-values obtained when testing H 01 , . . ., H 0k , respectively.The results for this procedure are headed in the tables by BH (and we will also refer to that test as the test BH).• The tests in Gaigall (2021).

Simulations for the level
We first consider the test in Sect.3, which rejects H 0 when T k,n > t k,n,α .By construction, if one uses that critical region, then the test will have exactly level α.So, to check the actual level of this test has no interest.Recall that Theorem 1 states that, under certain assumptions, the asymptotic null distribution of T k,n coincides with that of the one-sample test statistic, T 1,n .Therefore, we could approximate the percentiles of T k,n by those of T 1,n .Here we study that approximation by simulation, and hence we consider the test that rejects H 0 when T k,n > t 1,α where t 1,α is the α upper percentile of the null distribution of T 1,n .The results for such test are headed in the tables by T as k,n (and we will also refer to that test as the test T as k,n ).The critical point t k,n,α can be also approximated by the α upper percentile of the asymptotic null distribution of T k,n , say t α .As explained by the end of Sect.2.1, t α can be estimated by using the Pearson system of distributions, as proposed in Henze (1990).The results for such test are headed in the tables by T Pe k,n (and we will also refer to that test as the test T Pe k,n ).We also consider the test that rejects H 0 when T 0,k ≥ z 1−α .The results for such test are headed in the tables by T 0,k (and we will also refer to that test as the test T 0,k ).
The test statistics in Gaigall (2021) are sum of the Kolmogorov-Smirnov test statistics and the Cramér-von-Mises test statistics in each sample, and so, it is expected that, conveniently normalized (subtracting the mean and dividing by the square root of their variances, as we did in Sect. 5 to obtain T 0,k ) those statistics are asymptotically normal as k → ∞.Notice that the null distribution of the Kolmogorov-Smirnov and the Cramér-von-Mises test statistics in each sample does not depend on the values of the population mean and variance, but only on the sample size.We calculated by simulation, the means and the variances of the these statistics in a sample, and considered the test that rejects H 0 when K S ≥ z 1−α and the test that rejects H 0 when C M ≥ z 1−α , where K S and C M are the Kolmogorov-Smirnov and the Cramér-von-Mises analogues of T 0,k , respectively.The results for such tests are headed in the tables by K S and C M (and we will also refer to those tests as the test K S and the test C M), respectively.
In each case we did the following experiment: k random samples with size n i = m, 1 ≤ i ≤ k, were generated from a standard normal law and the tests K S, C M, T as k,n , T Pe k,n , T 0,k and BH were applied with α = 0.05.The experiment was repeated 10,000 times (all simulations for the level in this paper are based on 10,000 samples).Table 2 displays the proportion of times that H 0 was rejected for k = 2, 3, 5, 10, 20 and n i = 5, 10, 15, . . ., 45.Looking at Table 2 we conclude that: (a) for the test T as k,n : as expected, the approximation works when k is small in relation to n i , the distortion of the level may be important if such relation is not met; (b) the same can be concluded for the test T Pe k,n , whose performance is quite close to T as k,n ; (c) for the test T 0,k : the approximation works better for larger values of k (as expected), nevertheless, the empirical levels are not far apart from the nominal level even for k = 2; its behavior does not seem to be influenced by the sample sizes; (d) the same can be concluded for the tests K S and C M; (e) for the test BH: the levels are reasonably close to the nominal level.
The above experiment was repeated for larger values of k (specifically, for k = 100, 200) and n i = 5, 10, 15, 20, but instead of the test T as k,n (and T Pe k,n ) we considered the random selection test studied in Sect. 4 with k 0 = 10, 20, headed in the tables by RP (and we will also refer to that test as the test RP).We tried that test with one and more than one random selections.In view of the results of Table 2 for the test T as k,n (and T Pe k,n ), and since the sample sizes considered are not very large compared to k 0 , we used the exact critical values of the null distribution of T k 0 ,n 0 ; when several random selections are taken into account (5, 10, 20 and 30), we proceed as explained before for the test BH.The results obtained are displayed in Table 3.In all cases, the empirical levels match quite closely the nominal value.

Simulations for the power
As said at the beginning of Sect.3, in our developments we chose β = 1, although all results keep on being true for arbitrary (but fixed) β.It is well-known that, for finite sample sizes, the power of the BHEP test strongly depends on the value of the parameter β and on the alternative.Tenreiro ( 2009) observed that for short-tailed alternatives, high power is obtained for large (but not too large) values of β, and for long-tailed (symmetric or asymmetric) alternatives a small value for β should be chosen.Since the tests studied in this paper are all of them based on the BHEP test, it is expected that they inherit their characteristics.So, in the power study, we tried the proposed tests for several values of β.As in Ebner and Henze (2021), we considered β ∈ {0.25, 0.5, 0.75, 1, 2, 3, 5, 10}.
To examine the power we repeated the experiments in the previous section, but now r samples where taken from an alternative distribution and the other k − r samples were generated from a standard normal distribution.r is taken so that the percentage of alternative distributions equals 20%, 40%, 60% and 80%.All simulations for the power in this paper are based on 2,000 samples.For the test that rejects H 0 when T k,n > t k,n,α (the dependence on β is skipped to simplify notation), we calculated the exact critical values (headed in the tables by T ex k,n , we will also refer to that test as the test T ex k,n ).The columns headed as R P 1 and R P 2 display the results for the random selection method with k 0 = 10 and k 0 = 20, respectively, both with 30 random selections because, in most cases, that number of selections gave the higher power (taken among 1, 3, 5, 10, 20 and 30 random selections).
As for the alternatives, we considered several short-tailed and some long-tailed distributions.The picture in each case is very similar and agree with Tenreiro (2009) observations.Due to space limitations, next we just summarize the experiment results.All tables and a detailed description of the numerical results can be found in the Supplementary Material.
Looking at the tables for the power in the Supplementary Material, in general, it can be concluded that: (a) the power of all tests increases with the percentage of alternative distributions, with the sample sizes and with k; (b) the test BH gives the poorest results; (c) there is no test giving the highest power for all alternatives; (d) among the BHEP-based tests, we again see that there is no one giving the highest power for all alternatives; nevertheless, T 0,k gives powers (for adequate choices of β) that are either optimal of reasonably close to the optimal; (e) K S is less powerful than C M; their powers are smaller than those of T ex k,n and T 0,k (for adequate choices of β), for small k, and of R P 2 and T 0,k (for adequate choices of β), for larger k.
From a computational point of view, among BHEP-based tests, the test T 0,k is the best choice, since the number of required computations for the calculation of its test statistic is of order O(km 2 ), and its application does not involve the calculation of critical points.

Table 2
Empirical levels for small to moderate values of k at the nominal level

Further simulation results
In the above experiments all samples have the same size n 1 = n 2 = . . .= n k := m, so the data can be also seen as An anonymous referee asked us to apply the BHEP test to the Y -data.With this aim we need m ≥ k + 1 , so we only considered the cases k = 10 with n i = 15, 20, 25, 30 and k = 20 with n i = 25, 30.Tenreiro ( 2009) performed an extensive simulation study on the power of the BHEP test for a wide range of data dimension.He recommends using β k = √ 2/(1.376+ 0.075k).We repeated the power simulation study for the Y -data using β k and the critical points of the BHEP test for each value of k (that now becomes the dimension) and m (the sample size).The obtained results are displayed in the Supplementary Material.Comparing them with those yielded by the tests proposed in this paper, CM and KS, applied to the X -data, we see that when the BHEP is applied to the Y -data the power is really poor.
The above Y -data description assumes that the components of Y are independent.To numerically study the effect on the level of the tests considered in Sect.6.1 when the independence assumption is dropped, the following simulation experiment was carried out: we generated data from Y ∼ N k (0, ρ ), where ρ is the equicorrelation matrix, and then applied the tests in Table 2 to the associated X -data.The obtained results are displayed in the Supplementary Material.Looking at them we see that as the dependence between the components of Y becomes stronger, the empirical levels move further away from the nominal value 0.05.As a consequence, the case of Y -data with correlated components requires the development of new procedures that take into account such dependence.

Concluding remarks and further research
This paper proposes and studies three procedures for simultaneously testing that k independent samples come from normal populations, which can have different means and variances.All of them are based on the BHEP test and allow k to increase.The first test, based on the test statistic T k,n , can be seen as the direct extension of the BHEP test.Its null distribution only depends on k and the sample sizes, so that exact critical points can be calculated by simulation, but when the sample sizes are large relative to k, one can use the critical points of the BHEP test.If k is very large, the practical calculation of the test statistic T k,n is very time-consuming, so one can randomly select k 0 samples (one or more times) and apply the previous test to the selected samples.One can also calculate the BHEP test statistic in each sample and then sum the obtained values, which conveniently centred and scaled (T 0,k ) converges in law to a standard normal law when the null hypothesis is true.The normal approximation works reasonably well even for not too large k, so its practical use does not require the calculation of critical points.All test are consistent against alternatives where the fraction of samples not obeying the null goes to a positive constant.The test based on T 0,k is, from a computational point of view, the best choice.
This paper is centered in studying BHEP-based procedures for simultaneously testing that k independent samples come from normal populations.Other normality tests could be used to build similar procedures to those developed in Sects.3-5.Moreover, parallel approaches could be used for simultaneously testing that k independent samples come from any location-scale family.
As observed in Remark 2, in certain specific settings, the Bahadur efficiencies calculations made in Ebner and Henze (2021) and Meintanis et al. (2022) for the BHEP test, also apply to the tests proposed in Sects.3 and 4. It would be interesting to study Bahadur efficiencies in more general settings, and also for the test in Sect. 5.Those calculations could help to determine optimal values of β and k 0 .
Finally, in simulations we saw that the tests studied in this paper are not valid for dependent data.New procedures that take into account such dependence should be developed.

Proofs
Along this section M is a generic positive constant taking many different values throughout the proofs.

Auxiliary results
Lemma 1 Suppose that (6) holds and that H 0 is true.Then , where stands for the gamma function.Using Legendre duplication formula (see display 5.5.5 of Olver et al. 2010) and Stirling's formula (see display 5.11.3 of Olver et al. 2010), one gets that, for large n j , Since n j − 1 X j /S j ∼ t n j −1 , one gets that E X 2 j /S 2 j = 1/(n j 3).We also have that and, for n j ≥ 6, 2) The proof is similar to that of (b.1).(b.3, b.4)The proof is similar to that of (b.1) using Lemma 1 (b) and (c).(b.5)We have that E{W n (t) 2 } = t 2 ϕ 0 (t) 2 (1/n) k j,v=1 E jv , where Finally, the assertion follows by applying the continuous mapping theorem.

Proof of Theorem 2
By Taylor expansion, proceeding as in the proof of Proposition 1, we can write where Z k,n is as defined in ( 7) and . The reasoning in the proof of Lemma 3 can be used to prove (using now the SLLN in Hilbert spaces, see Theorem 2.4 in Bosq (2000)) that W n,1 a.s.

Proof of Theorem 3
We have that power and a(i) is as defined in (15).If, for some j, I j is empty, then we define j = 0. Suppose that I j is not empty.
Proof of Theorem 4 T 0,k is a sum of independent zero-mean random variables.Thus, it suffices to see that Lindeberg condition below is met, where 1(•) stands for the indicator function.From Proposition 1 it follows that k i=1 τ 2 0,n i ≥ kτ 0 , ∀k, for some τ 0 > 0. As a consequence, we have that, From Proposition 1, it also follows that lim k→∞ sup n μ(n, k, ε) = 0.This fact implies that H (ε) → 0 as k → ∞, and thus the result is proven.

Proof of Theorem 5
We first prove that where With this aim, we prove that the Lindeberg condition below is met, → 0, ∀ε > 0, as k → ∞.

Fig. 1
Fig. 1 True values for μ 0,n (left panel) and τ 2 0,n (right panel), 5 ≤ n ≤ 100.The line y = μ 0 (left panel) and the line y = σ 2 0 (right panel) are in red u: u≤ p 0 k 0 u + u: u> p 0 k 0 u := ≤ p 0 + > p 0 .Let H ∼ H (k, r , k 0 ), where H (k, r , k 0 ) stands for a hypergeometric distribution: from a population with k units, r of type A and k − r of type B, a sample without replacement of size k 0 is selected, H is the number of units type A in the sample.LetI( p 0 ) = {i ∈ I u , u > p 0 k 0 } and N p 0 = card{I( p 0 )}.We can write > p 0 = u: u> p 0 k 0 P(H = u) + 1 k k 0 i∈I( p 0 )(a(i) − 1) := W 1 + W 2 .
i X i ,n i − μ 0,n i )

Table 1
Values of μ 0,n and n X ,n , for 3 ≤ n ≤ 10 and some alternative distributions for X