1 Introduction

In many psychological, biological and medical experiments, data are collected in terms of a matched pairs design, e.g. when a homogeneous group of subjects is repeatedly observed under two conditions called time points in the terminology of repeated measures designs. Hereby different variances of the observations occur in a natural way, e.g. when data are collected over time. The data of such trials can be modeled by independent and identically distributed random vectors

(1.1)

with expectation E(X 1)=μ=(μ 1,μ 2)′ and an arbitrary positive definite covariance matrix Var(X 1)=Σ. Our aim is to test the null hypothesis H 0:μ 1=μ 2, or \(H_{0}^{(1)}:\mu_{1}\leq\mu_{2}\), in this semi-parametric framework.

The paired t-test type statistic |T n,stud | with

(1.2)

is the commonly used statistic for testing H 0, where D i =X i,1X i,2 denote the differences of the pairs for i=1,…,n, \(\overline{D}_{n}=n^{-1}\sum_{i=1}^{n} D_{i} = \overline{X}_{1} - \overline{X}_{2}\) is the difference of the means, and \(V_{n}^{2}=(n-1)^{-1}\sum_{i=1}^{n} (D_{i}-\overline{D}_{n})^{2}\) denotes the sample variance of the D i ’s. As commonly known, T n,stud is exactly T(n−1)-distributed under H 0, if the differences are normal, even for arbitrary Σ. Under non-normality, the distribution of T n,stud may be approximated by a T(n−1)-distribution, which follows from the central limit theorem. For large sample sizes, the null hypothesis H 0:μ 1=μ 2 will be rejected if |T n,stud |≥t 1−α/2, where t 1−α/2 denotes the (1−α/2)-quantile from the T(n−1)-distribution. Thus, the t-test can be equivalently written as

(1.3)

For testing \(H_{0}^{(1)}\) the t-test φ t can be redefined by using T n,stud as the test statistic in (1.3) and replacing the critical value t 1−α/2 by t 1−α . In a variety of papers and applications, however, it has already been shown that the rate of convergence from T n,stud to its asymptotic normality is rather slow, particularly for skewed distributions of the differences. For a detailed explanation we refer the reader to Munzel (1999).

It is the aim of the present paper to discuss the limit behaviour of various resampling versions of T n,stud to improve its small sample properties under non-normality. Specific examples are all kind of bootstrap and permutation resampling statistics. Although the data may not be exchangeable in model (1.1), an accurate and (asymptotically) valid level α resampling test for H 0 can be derived if (i) the resampling distribution of the statistic is asymptotically independent from the distribution of the data; (ii) the resampling distribution has a limit; and (iii) if the distribution of the test statistic and the conditional resampling distribution (asymptotically) coincide, see Janssen (1997, 1999a, 1999b, 2005), Janssen and Pauls (2003, 2005), Neubert and Brunner (2007), Pauly (2011) or Omelka and Pauly (2012). The items (i)–(iii) will be referred to the permanence property of resampling tests.

More details on theory and applications of bootstrap and permutation tests can be found in the monographs of Basso et al. (2009), Good (2005) as well as Pesarin and Salmaso (2010b). Moreover, when comparing more than one aspect of the data, Brombin et al. (2011) also discuss permutation tests for paired observations with an useful application. In particular, permutation approaches for multivariate data are intensively discussed by Pesarin and Salmaso (2012) and Brombin and Salmaso (2009). Both papers provide a detailed summary of existing procedures and some new developments. Regarding repeated measures designs, Pesarin and Salmaso (2010a) apply permutation tests by investigating finite-sample properties.

The intuitive resampling or permutation strategy is to draw the differences with replacement D i from the data, or to permute the variables X i,1 and X i,2 within the pairs, respectively. The lack of both resampling schemes is that only a few permutations (2n) are available, or that a small variety within the resamplings occurs when n is rather small. The counterintuitive resampling or permutation strategies are either drawing the variables X i,s with replacement from all 2n observations X 1,1,…,X n,2, drawing the variables \(X_{i,s}-\overline{X}_{s}\) with replacement from each marginal sample X 1,s ,…,X n,s ,s=1,2, separately, or to permute all 2n observations in X=(X 1,1,X 1,2,…,X n,2)′, and then repeatedly compute (e.g. 10,000 times) the paired t-test statistic. On the one hand, these counterintuitive resampling methods increase the resampling variability, on the other hand, the dependency structure within the pairs is neglected. In this paper, it will be shown that both kinds of the intuitive and also the counterintuitive resampling strategies, which neglect the dependency structure in the data, fulfill the permanence property, and thus, the corresponding resampling tests are asymptotically valid. Extensive simulation studies show that especially permutation-based approaches improve the paired t-test, even for extremely small sample sizes. The paper is organized as follows: In Sect. 2 we explain how resampling and permutation tests work and explain in detail why the resulting tests are asymptotically valid. In Sect. 3 extensive simulations are conducted to compare the different resamplings with the paired t-test. The paper closes with a discussion of the results. All technical details and proofs are given in the Appendix.

2 How do paired bootstrap and permutation tests work?

In this section we will study various resampling versions of the paired t-Test. Among others we like to point out why special bootstrap and permutation tests, which neglect the dependency structure of the data within their resampling scheme, are asymptotically valid level α tests for H 0. Let \(\mathbf{X}^{\ast}=(\mathbf{X}_{1}^{\ast}, \ldots, \mathbf{X}_{n}^{\ast})'\), with \(\mathbf{X}_{i}^{\ast}=(X_{i,1}^{\ast},X_{i,2}^{\ast})\), denote n resampling vectors for i=1,…,n, given the original data X, where

  1. (I)

    X is a random permutation of all data X=(X 1,1,X 1,2,…,X n,2)′, or

  2. (II)

    \(\mathbf{X}_{i}^{\ast}\) is a random permutation of the sample unit \(\mathbf{X}_{i}'=(X_{i,1},X_{i,2})\), or

  3. (III)

    \(X_{i,s}^{\ast}\) is randomly drawn with replacement from all data X, or

  4. (IV)

    \(X_{i,s}^{\ast}\) is randomly drawn with replacement from each centered marginal sample \(\mathbf{X}_{s}=(X_{1,s}-\overline{X}_{s},\ldots, X_{n,s}-\overline{X}_{s})',s=1,2\), respectively.

The conditional resampling statistic of T n,stud is then given by

(2.1)

where \(D_{i}^{\ast}= X_{i,1}^{\ast}- X_{i,2}^{\ast}\) denotes the differences of the resampling variables for i=1,…,n, \(\overline{D}_{n}^{\ast}= n^{-1}\sum_{i=1}^{n} D_{i}^{\ast}\) denotes their mean, and \(V_{n}^{\ast 2}=(n-1)^{-1}\sum_{i=1}^{n}(D_{i}^{\ast}-\overline{D}_{n}^{\ast})^{2}\) denotes the sample variance of the differences \(D_{i}^{\ast}\).

Here we like to point out that the denominator in (2.1) is part of the resampling procedures, which is in accordance with the guidelines for bootstrap testing, see Hall and Wilson (1991), Beran (1997), Bickel and Freedman (1981), and Janssen (2005). Delaigle et al. (2011) have further shown that studentized resampling t-statistics are more robust and accurate than non-studentized statistics. The following gives an explanation how the corresponding resampling tests can be computed.

The introduced conditional resampling tests rely on a reference distribution \(\mathcal{L}(T_{n,stud}^{\ast}|\mathbf{X})\) given the data X. This means that the data are treated as fixed values, and quantiles from the conditional resampling distribution of \(T_{n,stud}^{\ast}\) are estimated to compute critical values. Denote by \(c_{n}^{\ast}(1-\alpha)\) the (1−α)-quantile of \(\mathcal {L}(T_{n,stud}^{\ast}|\mathbf{X})\). Then, according to the definition of the paired t-test in (1.3), conditional resampling tests can be written as

(2.2)

Next we will prove that \(T_{n,stud}^{\ast}\) as given in (2.1) is asymptotically standard normal under all of the different resampling schemes described above. In particular, we will show that the permanence property is fulfilled, thus, \(\varphi_{n}^{\ast}\) is an asymptotically valid test for H 0. Its asymptotic normality is particularly derived under arbitrary alternatives, i.e. we do not assume that H 0 is true. To give an answer to the question “How do paired Bootstrap and Permutation tests work?” we will introduce the following criterion from Janssen and Pauly (2010), which uses the paired t-test as a benchmark for the resampling procedures.

Definition 2.1

The conditional tests \(\varphi_{n}^{*}\) defined in (2.2) are called

  1. (i)

    asymptotically effective under H 0 with respect to the paired t-test, iff

    (2.3)
  2. (ii)

    consistent iff

    (2.4)

    for μ 1μ 2 as n→∞.

Now we can formulate

Theorem 2.1

The resampling tests \(\varphi_{n}^{*}\) defined in (2.1) are asymptotically effective with respect to φ t and consistent under all resampling schemes (I) through (IV).

From the proof it can be seen that a similar result also holds for one-sided versions of the tests. For further details see the Appendix. Specifically, Theorem 2.1 shows that the counterintuitive resampling procedures (I), (III) and (IV) are asymptotically valid, because studentized statistics are resampled. Roughly speaking, the studentization of the resampling variables “deletes” the dependency structure in the data when n is sufficiently large.

2.1 Resampling the differences D i

In this subsection we will also introduce resampling methods, particularly wild boostrap methods, which are based on the differences D i . The wild bootstrap technique is motivated by the residual bootstrap commonly applied in regression analysis, see Wu (1986), Mammen (1992) and Beran (1997), and in time-series testing problems, see Kreiss and Paparoditis (2011). It is also proposed in the context of survival analysis, see Lin (1997) or Beyersmann et al. (2012). Here, we adapt the wild bootstrap to the simple matched pairs design and we will compare the accuracy of the resulting test procedures with the resampling tests in (2.1) in extensive simulation studies. Let \(D_{i}^{\ast}\) denote n resampling variables given the original differences D=(D 1,…,D n )′, where \(D_{i}^{\ast}\) denotes the observed value from

  1. (V)

    drawing with replacement from all differences D, or

  2. (VI)

    from a wild bootstrap method with \(D_{i}^{\ast}= W_{i}D_{i}\), where W i ,i=1,…,n, denote independent and identically distributed random variables, which are independent from the D i ’s, with E(W 1)=0 and Var(W 1)=1.

The corresponding resampling tests are then defined as in (2.2) with the paired t-test type resampling statistic

(2.5)

where now \(\overline{D}_{n}^{\ast}= n^{-1}\sum_{i=1}^{n} D_{i}^{\ast}\) denotes the mean of the resampled differences, and \(V_{n}^{\ast2} = (n-1)^{-1}\sum_{i=1}^{n}(D_{i}^{\ast}-\overline {D}_{n}^{\ast})^{2}\) denotes the sample variance of the \(D_{i}^{\ast}\)’s. The effectiveness of these resampling procedures is given in the next theorem.

Theorem 2.2

The resampling tests \(\varphi_{n}^{*}\) defined in (2.5) are asymptotically effective with respect to φ t and consistent under both resampling schemes (V) and (VI).

Example and Remark 2.1

In our simulation study in Sect. 3, we will focus on the following weight examples. However, there are of course others that may be of interest for particular situations.

  1. (a)

    W i ,i=1,…,n is a sequence of symmetric independent and identically distributed random variables with

    In this case it even holds that \(E(W_{1}^{3})=1\). These wild bootstrap weights are typically used for studentized test statistics, see e.g. Kreiss and Paparoditis (2011). We will call the corresponding test Rademacher wild bootstrap.

  2. (b)

    W i ,i=1,…,n, is a sequence of independent and identically distributed Gaussian random variables, i.e. W i N(0,1). This corresponds to the resampling procedure proposed by Lin (1997).

We note that Arlot et al. (2010a, 2010b) investigate wild bootstrap methods for multiple comparisons and confidence intervals in high-dimensional data using random signs W i ,i=1,…,n, with distribution P(W 1=−1)=P(W 1=1)=1/2. This resampling method, however, is equivalent to the resampling scheme (II). For further details we refer the reader to Janssen (1999b).

Theorems 2.1 and 2.2 state that all the considered procedures fulfill the permanence property, thus, the corresponding tests \(\varphi_{n}^{\ast}\) are asymptotically valid. The numerical algorithm for the computation of the p-value is as follows

  1. (1)

    Given the data X, compute the paired t-test statistic T n,stud as given in (1.2).

  2. (2)

    Repeat the resampling steps N times (e.g. N=10,000), compute the values \(T_{n,stud}^{\ast}\) and save them in A 1,…,A N .

  3. (3)

    Estimate the two-sided p-value by

    In comparison to that the one-sided p-value is given by p 1.

3 Simulations

For testing the two-sided null hypothesis H 0:μ 1=μ 2 formulated above, we consider the unconditional t-test φ t based on the T(n−1)-approximation of the statistic T n,stud in (1.2) and the various conditional resampling tests \(\varphi_{n}^{\ast}\) based on the resampling schemes (I) through (VI) as described in Sect. 2. The simulation studies are performed to investigate their behaviour with regard to maintaining the pre-assigned type-I error level under the hypothesis, and the power of the statistics under alternatives. The observations X i =(X i,1,X i,2)′,i=1,…,n, were generated using marginal distributions F s and varying correlations ρ∈(−1,1). We hereby generate exchangeable matched pairs having a bivariate normal, exponential, log-normal or uniform distribution, each with correlation ρ∈(−1,1), as well as non-exchangeable data by simulating

  1. (a)

    F 1=N(0,1) and F 2=N(0,2),

  2. (b)

    F 1=N(0,1) and F 2=N(0,4),

  3. (c)

    F 1=N(3,4) and \(F_{2}= \chi_{3}^{2}\), and

  4. (d)

    F 1=N(exp(0.5),3) and F 2=LN(0,1),

each with correlation ρ, respectively. Routine calculations show that μ 1=μ 2 is fulfilled in all of these considerations. We only consider the small sample sizes n=7 and n=10 throughout this paper. All simulations were conducted with the help of R-computing environment, version 2.13.2 (www.r-project.org), each with nsim=10,000 and N=10,000 bootstrap runs. The simulation results for exchangeable normally, exponentially, log-normally, and uniformly distributed matched pairs with the very small sample size of n=7 and different correlations ρ are displayed in Table 1.

Table 1 Type-I error level (α=5 %) simulations for very small sample sizes (n=7) with exchangeable distributions

It follows from Table 1 that the paired t-test is an accurate procedure for symmetric distributions (normal and uniform), even for the very small sample size of n=7. When the data are skewed (exponential and log-normal), the t-test tends to be conservative. It is apparent that both the wild bootstrap methods using the Rademacher weights as defined in Remark 2.1(b) and the Gaussian weights given in Remark 2.1(c) are inappropriate tests for such small sample sizes. The resampling test with Rademacher weights is very liberal. This can be explained by the fact that these weights are very skewed distributed. Roughly speaking, both wild bootstrap resampling distributions are too far away from the distribution of T n,stud , when n is rather small and the original data are not resampled. Simply drawing the differences from the data with replacement can not be recommended either. The corresponding test tends to be quite liberal when the data are skewed. This occurs, because the resampling variability (i.e. the variability within the resampling variables \(D_{i}^{\ast}\)) is rather small when n=7. However, drawing with replacement from either all 2n observations or from each marginal separately, results in more accurate test decisions. Comparing these results with the permutation based approaches, it is easily seen that both kind of permutation tests (i.e. to permute all data, or to permute within the sample unit) control the type-I error level for all distributions and all dependencies ρ in the data. Next we investigate the behaviour of the different resampling tests for larger n=10. The simulation results are displayed in Table 2.

Table 2 Type-I error level (α=5 %) simulations for moderate sample sizes (n=10) with exchangeable distributions

From Table 2 an interesting phenomenon of replacement procedures with resampling scheme (IV) and (V) can be observed: The rejection rates do not converge linearly in n to α. The tests are more liberal with n=10 than with n=7. Their liberality increases with an increasing n up to the breakpoint n≈15. With larger n (e.g. n≥30), all resampling test based on drawing with replacements are accurate. The liberality of the wild bootstrap tests using Rademacher or Gaussian weights decrease. Both kind of permutation approaches, however, are still the most accurate procedures.

Now we investigate how accurate the tests control the type-I error level when both marginal distributions are different. The simulation results for different non-exchangeable distributions (a) through (d) with n=7 and varying correlations are displayed in Table 3.

Table 3 Type-I error level (α=5 %) simulations for very small sample sizes (n=7) with non-exchangeable distributions (a) through (d) as described in the text

It follows from Table 3 that both the permutation approaches are accurate, even for non-exchangeable distributions, n=7, and permutations of all data X. When two distributions with extremely different shapes and negative correlations (normal versus log-normal) are compared, they tend to be slightly liberal. The same conclusions, however, can be drawn for the t-test. In Table 4 the simulation results for n=10 and the same non-exchangeable distributions are given.

Table 4 Type-I error level (α=5 %) simulations for moderate sample sizes (n=10) with non-exchangeable distributions (a) through (d) as described in the text

For larger n=10, both permutation approaches are accurate and demonstrate a similar behaviour to the t-test.

To compare the power of the tests, we generate bivariate normally and log-normally distributed matched pairs with n=10 and n=20, respectively, each with correlation ρ=1/2. Hereby, we shifted the data under time-point 2 with δ∈(0,1). The simulation results for n=10 are displayed in Table 5. Although both the wild bootstrap methods using Rademacher and Gaussian weights, as well as the resampling tests based on scheme (III)–(V) were quite liberal in these situations, we included them in the power simulation study. However, to give a fair comparison between the procedures, we will not grade them in detail and concentrate on the t-test and the permutation based approaches.

Table 5 Power (α=5 %) simulations for moderate sample sizes (n=10) and ρ=1/2

It follows from Table 5 that both the permutation approaches have a comparable power to the t-test under normality. Under non-normality, the power of the permutation based approaches is remarkably higher. The same conclusions can be drawn for n=20, as can be seen from Table 6.

Table 6 Power (α=5 %) simulations for moderate sample sizes (n=20) and ρ=1/2

4 Discussion

We analyzed two different permutation approaches for testing H 0:μ 1=μ 2 with paired data under non-normality. Particularly, we demonstrated that the usual assumption of exchangeability is not necessary for the construction of permutation tests. We have analytically shown that permutation approaches, which are based on permutations of all observed data (i.e. neglecting the dependency structure), are asymptotically valid procedures. The results are obtained by investigating the conditional permutation distribution of studentized statistics. All results in this paper would not hold without the studentization. The investigation of permutation techniques in heteroscedastic repeated measures designs will be part of future research.

In this paper, only mean based approaches were considered. Rank-based studentized permutation tests are proposed by Konietschke and Pauly (2012).