1 Introduction

The testing problem of a population null hypothesis was of interest to Neyman and Pearson but not to Fisher. The differential approach to the philosophy of science between Neyman–Pearson and Fisher is quite discernible in Fisher’s own polemic article wherein he put it as differences in logical points of view (see Fisher 1955, p. 69). He further stated

\(\ldots \) we consider a continuum of hypotheses each eligible as null hypothesis, and it is the aggregate of frequencies calculated from each possibility in turns as true—including frequency of error, therefore only the “first kind”, without any assumption of knowledge a priori–which supply [ ] the amounts of information available.

Basically, Fisher’s argument concerned two issues with the Neyman–Pearson theory: (i) the assumption of repeated sampling from the same population, and (ii) the definition of errors of the second kind (type II error). For him, all one can do from a single experiment is to make inferences to the sample. Extrapolating the result by assuming repeated random sampling from a well-defined population with an aim of testing a population null made no sense for him.

Under the sharp null of no effect for any unit, the Fisher randomization test is a valid procedure for inference to the sample. The test has the correct level for an effect in the sample without needing any further assumption (Rubin 1980, 1986; Rosenbaum 2007). Under the additional stable unit treatment value assumption (Rubin 1980) the test has the correct level for, e.g. testing for an average treatment effect in the sample. Thus, with an interest of testing an average treatment effect in the sample we could either use the Fisher randomization test or the Neyman’s test (a t-test) (cf. Ding 2017).

The motivation for this article stems from interesting results in Ding (2017) who compared the power of randomization test with the t-test for design based inference to the sample. Using finite population asymptotic theory, Ding (2017) showed that Fisher’s test statistic, based on the mean-difference estimator (MDE), is approximately normal under the sharp null. He concludes that, if this normal approximation is used under the alternative, the Fisher test is less powerful than the t-test even for the simplest case of homogeneous effect. Moreover, the relative power of the t-test against Fisher’s test is shown to be increasing with the size of the treatment effect.

The growing interest in the Fisher’s test based on the MDE in social science (Athey and Imbens 2017; Young 2018) for inference to the superpopulation makes these results important and relevant to examine. One reason is that the results in Ding (2017) are only established under the null, so that any possibility of revealing the otherwise differences under the alternative remains unexplored. A second reason is that the results in Ding (2017) are in contrast to theoretical results based on repeated sampling, or superpopulation, assymptoticsFootnote 1 and empirical results on Fisher inference to the superpopulation (Young 2018). The increasing popularity of computer based experimental designsFootnote 2 makes up yet a reason. The reason is that for some of these algorithms Neyman–Pearson inference is not an option, but the Fisher randomization test is. Thus, if this test is less efficient than the Neyman test, this suggests that the efficiency gains from some of the designs may well be lost when conducting the inference.

In line with Fisher we define the population null as the scientific null and the sample null as the statistical null. We discuss super population asymptotics and finite population asymptotics for both Neyman and Fisher randomization test based on the MDE for the two estimands; the population average treatment effect (PATE) and the sample average treatment effect (SATE). Following the same data generating processes as in Ding (2017), we then conduct a simulation study to assess the power properties of Fisher’s test based on MDE for inference to both the PATE and SATE. To our knowledge, this simulation study is the first of its kind, expectedly due to the computational complexity of calculating power of the Fisher randomization tests.

Our simulation results show no overall superiority of Neyman’s test over Fisher’s test for any effect size. Instead, the property of a test being most powerful in this case seems to depend on the characteristics of the outcomes in the given sample. For most samples, and over most subsets of allocations, however, the tests show similar performance.

In addition, we find that with heterogeneous treatment effect, both tests have in general the wrong size when testing the population null in a single experiment. The results illustrates Fisher’s concern about the “continuum of hypotheses each eligible as null hypothesis”. The general attitude in the research community is that the t-test is preferable to the exact test as it is not restricted to the unrealistic assumption of homogeneous treatment effects. However, with heterogeneous effects the statistical null differ from scientific null. The implication is that size of test against a scientific null is in general only correct under repeated sampling from the same population and this holds for both tests.

The results also display that finite sample asymptotics is useful as it allows for Neyman–Pearson design based inference for a fixed, and quite small, sample size. However, it also shows that a comparison of repeated sampling inference with design based asymptotics is not meaningful as asymptotically the MDE has the same variance and SATE equal PATE.

It may also be added that our results are in line with those in Young (2018) who studied the differences in performance for the inference to the population using Monte Carlo simulation and reanalyzed 53 experimental papers culled from the journals of the American Economic Association. He found similar performance when there are no influential observations, but that the exact Fisher test based on the t-statistic (Chung and Romano 2013) has better size properties than the Neyman test. Note further that our focus is on the sample, or experiment, which clarifies the case for both tests with inference to the sample in case of heterogeneous effects. Since Young (2018)’s focus is on differences in inference to the population, the two studies are complementary to each other.

We may also refer to a few recent variants and extensions of the theory in different directions. Ding and Dasgupta (2018), by re-formulating the testing problem of an equal average treatment effect as a general linear hypothesis (GLH) test, consider Wald, ANOVA-type, and least-squares based test statistics, with emphasis on certain special cases. Wu and Ding (2021) discuss certain alternative strategies to Fisher’s randomization test to test equality of average treatment effects. By re-formulating the testing problem as a general linear hypothesis (GLH), using an appropriate GLH matrix consisting of contrast comparisons, they consider Wald, ANOVA-type, and least-squares based test statistics, with focus on certain special cases. The same testing problem is also considered in Zhao and Ding (2021), additionally by adjusting the responses for the presence of covariates.

We begin, in the next section, with a brief orientation to the Neyman and Fisher lines of inference using Neyman (1923)’s potential outcome framework. The simulation scheme, comparing power of Neyman test and exact Fisher randomization test (FRT) with large n, is discussed in Sec. 3, where its results are presented and discussed in Sec. 4. The paper concludes with some general discussion in Sec. 5.

2 Neyman and Fisher inference

Let \(Y_{i}(w) \in R\) denote the potential outcome for unit i with binary indicator w referring to the treatment group, i.e. \(w = 1\) implies treatment and \(w = 0\) implies control. In a completely randomized experiment with n units, \(n_{1}\) units are assigned to the treatment group and \(n_{0}\) units to the control. As a unit can only be assigned to one of the two groups, letting \(W_{i}=1\) or 0 if unit i is assigned to treatment or control group, respectively, the observed outcome can be either \(Y_{i}(W_{i}=1)\) or \(Y_{i}(W_{i}=0)\). In a more compact form, we can thus write an observed outcome for unit i as

$$\begin{aligned} Y_i = W_iY_i(1) + (1 - W_i)Y_i(0). \end{aligned}$$
(1)

Our inference for \(Y_i\), or any linear combinations thereof, will be based on the so-called SUTVA assumption (Rubin 1980), which implies that there is no interference between individuals and the same treatment. The quantity of our main interest, the mean difference estimator (MDE), is defined, using (1), as

$$\begin{aligned} {\widehat{\tau }} = {\overline{Y}}_{1}-{\overline{Y}}_{0}, \end{aligned}$$
(2)

where

$$\begin{aligned} {\overline{Y}}_{w}=\frac{1}{n_{w}}\sum _{i:W_{i}=w}^{n_{w}}Y_{i},w=0,1. \end{aligned}$$

For the inference of \({\widehat{\tau }}\), we need additional assumptions to be stated later.

Throughout the paper we consider a sample of size n to be a random sample from a (potential) finite population of N units, \(n\le N\), and special cases thereof. For clarity, we will index \({\widehat{\tau }}\) to indicate over which distribution we randomize, i.e, what sampling design is being considered. For example, we let \({\widehat{\tau }}_{N,n}\) denote the estimator over random sampling from the population and over random treatment assignment within the random sample. When \(n=N\), we for simplicity denote \({\widehat{\tau }}_{N,n} = {\widehat{\tau }}_{n}\) and when sampling from the super population we denote the estimator \({\widehat{\tau }}_{\infty , n}\).

2.1 Neyman inference

Given population and sample sizes, N and n, respectively, a total of \(S = \genfrac(){0.0pt}1{N}{n}\) random samples can be drawn in this set up. Let \(\textbf{u}_{s}^{n}\) be a vector containing the indices of the units in the sth sample, for \(s = 1,...,S\). The sample average treatment effect (SATE) for the sth sample then follows as

$$\begin{aligned} {{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n}) = \frac{1}{n}\sum _{i\in \textbf{u}_{s}^{n}}(Y_{i}(1)-Y_{i}(0)), \end{aligned}$$
(3)

where the population average treatment effect (PATE) is

$$\begin{aligned} \tau = \mu _{1}-\mu _{0}, \end{aligned}$$

with \(\mu _{w}=\frac{1}{N}\sum _{i=1}^{N}Y_{i}(w)\), \(w = 0, 1\), denoting the population mean. Note that with homogeneous treatment effects \({{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n})=\tau , \forall s=1,...,S\).

Now, it can be shown that (see Aronow et al. 2014)

$$\begin{aligned} V({\widehat{\tau }}_{N,n})=\frac{1}{N-1}\left\{ \frac{N-n_{1}}{n_{1}}\sigma _{Y(1)}^{2}+\frac{N-n_{0}}{n_{0}}\sigma _{Y(0)}^{2}+2\sigma _{Y(1),Y(0)}\right\} , \end{aligned}$$
(4)

where

$$\begin{aligned}{} & {} \sigma _{Y(w)}^{2} = \frac{1}{N}\sum _{i=1}^{N}(Y_{i}(w)-\mu _{w})^{2}~~\text {and}~~\sigma _{Y(1)Y(0)}\\ {}{} & {} = \frac{1}{N}\sum _{i=1}^{N}(Y_{i}(1)-\mu _{1})(Y_{i}(0)-\mu _{0}), w=0,1. \end{aligned}$$

Likewise, the variance of the heterogeneous treatment effect follows as

$$\begin{aligned} \sigma _{\tau }^{2}= & {} \frac{1}{N}\sum _{i=1}^{N}(Y_{i}(1)-Y_{i}(0)-(\mu _{1}-\mu _{0}))^{2} \\= & {} \sigma _{Y(1)}^{2}+\sigma _{Y(0)}^{2}-2\sigma _{Y(1),Y(0)}, \end{aligned}$$

so that, we can re-write (4) as

$$\begin{aligned} V({\widehat{\tau }}_{N,n})= & {} \frac{1}{N-1}\left\{ \frac{N-n_{1}}{n_{1}}\sigma _{Y(1)}^{2}+\frac{N-n_{0}}{n_{0}}\sigma _{Y(0)}^{2}+\sigma _{Y(1)}^{2}+\sigma _{Y(0)}^{2}-\sigma _{\tau }^{2}\right\} \nonumber \\= & {} \frac{N}{N-1}\left\{ \frac{1}{n_{1}}\sigma _{Y(1)}^{2}+\frac{1}{n_{0}}\sigma _{Y(0)}^{2}\right\} -\frac{1}{N-1}\sigma _{\tau }^{2}. \end{aligned}$$
(5)

With \(\sigma _{\tau }^{2}\) fixed, the last term in (5) vanishes and the multiplying factor \(N/(N - 1) \rightarrow 1\) for \(N\rightarrow \infty \), reducing \(V({\widehat{\tau }}_{N,n})\) to the usual form of variance of two independent samples, i.e.,

$$\begin{aligned} V({\widehat{\tau }}_{N,n}) = \left\{ \frac{1}{n_{1}}\sigma ^2_{Y(1)} + \frac{1}{n_{0}}\sigma ^2_{Y(0)}\right\} [1 + o(1)]. \end{aligned}$$
(6)

Then, the standard central limit theorem (CLT) applies under random sampling mechanism. It is easy to show (see Theorem 5.1 in the Appendix) that, as \(n, N \rightarrow \infty \),

$$\begin{aligned} \frac{{\widehat{\tau }}_{N,n} - \tau }{\sqrt{{{\,\textrm{Var}\,}}({\widehat{\tau }}_{N,n})}} \xrightarrow {{\mathcal {D}}} N(0, 1), \end{aligned}$$
(7)

The proof follows from the discussion above, or as a special case of Example 6 in Li and Ding (2017).

It is evident that under the super population assumption

$$\begin{aligned} \frac{{\widehat{\tau }}_{\infty ,n} - \tau }{\sqrt{{{\,\textrm{Var}\,}}({\widehat{\tau }}_{\infty ,n})}} \xrightarrow {{\mathcal {D}}} N(0, 1), \end{aligned}$$

but with \(\sigma ^2_{Y(w)}\) = \({{\,\textrm{E}\,}}(Y_{i}(w)-\mu _{w})^{2}\), \(\mu _{w}\) = \({{\,\textrm{E}\,}}(Y_{i}(w))\), \(w = 0,1\) in (6).

With regard to Neyman’s within-sample inference, let

$$\begin{aligned} {\overline{Y}}(w)=\frac{1}{n}\sum _{i=1}^{n}Y_{i}(w)\text { and }S_{Y(w)}^{2}= \frac{1}{n-1}\sum _{i=1}^{n}(Y_{i}(w)-{\overline{Y}}(w))^{2},w=0,1. \end{aligned}$$

This helps us define

$$\begin{aligned} V({\widehat{\tau }}_{n})=\frac{S_{Y(1)}^{2}}{n_{1}}+\frac{S_{Y(0)}^{2}}{n_{0}}-\frac{S_{\tau }^{2}}{n}, \end{aligned}$$
(8)

as originally given by Neyman (1923), where \(S^2_{\tau }\) is the sample variance of the heterogeneous treatment, defined as

$$\begin{aligned} S^2_{\tau } = \frac{1}{n-1}\sum _{i=1}^{n}(Y_{i}(1)-Y_{i}(0)-({\overline{Y}}(1)-{\overline{Y}}(0))^{2}. \end{aligned}$$

This structure can be used to test Neyman’s SATE hypotheses

$$\begin{aligned} H^{n}_0(N):~{{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n}) = 0 ~~\text {vs.}~~H^{n}_1(N):~{{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n})\ne 0. \end{aligned}$$
(9)

Recall also that the corresponding PATE hypotheses are

$$\begin{aligned} H_0(N):\tau =0 ~~\text {vs.}~~ H_1(N):\tau \ne 0. \end{aligned}$$
(10)

It follows (see Theorem 5.2 in the Appendix) from \({\widehat{\tau }}_n\) as test statistic that for SATE hypotheses,

$$\begin{aligned} \frac{{\widehat{\tau }}_n - \tau }{\sqrt{{{\,\textrm{Var}\,}}({\widehat{\tau }}_n)}} \xrightarrow {{\mathcal {D}}} N(0, 1), \end{aligned}$$
(11)

where \({{\,\textrm{Var}\,}}({\widehat{\tau }}_n)\) is given in (8) and \({\widehat{\tau }}_{N,n} = {\widehat{\tau }}_n\) for \(N = n\), so that \({{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n}) \rightarrow \tau \) when \(n \rightarrow \infty \).

Neyman (1923) proposed

$$\begin{aligned} \widehat{{{\,\textrm{Var}\,}}({\widehat{\tau }}_{n})} = \frac{s_{Y_{1}}^{2}}{n_{1}}+\frac{s_{Y_{0}}^{2}}{n_{0}}, \end{aligned}$$
(12)

as a consistent estimator of the variances, simultaneously, for the inference to the sample (cf. Eqn. 11) and to the population for \(N \rightarrow \infty \) (cf. Eqn. 7), where

$$\begin{aligned} s^2_{Y_w} = \frac{1}{n_w - 1}\sum _{i:W_i=w}^{n_{w}}(Y_i - {\overline{Y}}_w)^2, \end{aligned}$$

Note that, (12) is obtained by ignoring \(S_{\tau }^{2}/n\) in \(V({\widehat{\tau }}_{n})\) making \(\widehat{{{\,\textrm{Var}\,}}({\widehat{\tau }}_{n})}\) an upper-bound estimator of \({{\,\textrm{Var}\,}}({\widehat{\tau }}_{n})\). Under rather weak assumptions (see assumptions (5.1) and (5.2) in the Appendix) we can replace \({{\,\textrm{Var}\,}}({\widehat{\tau }}_n)\) in equation (11) by the Neyman variance estimator \(\widehat{{{\,\textrm{Var}\,}}({\widehat{\tau }}_{n})}\).

Thus, for a realized experiment where the jth allocation is randomly selected, \(j=1,...,\genfrac(){0.0pt}1{n}{n_{1}}\), the asymptotic p-value of a two-sided test can be approximated by

$$\begin{aligned} \pi _{N}=2\Phi (-{\widehat{\tau }}_{n}^{j}/\sqrt{\widehat{Var({\widehat{\tau }}_{n}^j))}}), \end{aligned}$$
(13)

where \({\widehat{\tau }}_{n}^{j}\) is the estimate of the jth allocation based on the sample \(\textbf{u}_{s}^{n}\) and \(\Phi (.)\) is the distribution function of the normal distribution.

To summarize, the same test statistics are being used for inference to the PATE and SATE. In order to establish (11) we are implicitly assuming that the experiment is conducted on the whole populations.

2.2 Fisher’s exact randomization test

Consider Fisher’s null and alternative hypotheses, respectively, as

$$\begin{aligned} H_{0}^{n}(F): Y_{i}(1) = Y_{i}(0)\text { }\forall \text { }i\in \textbf{u}_{s}^{n}~~\text {vs.}~~H_{1}^{n}(F): Y_{i}(1)\ne Y_{i}(0),~i \in \textbf{u}_{s}^{n}, \end{aligned}$$

where \(H_{0}^{n}(F)\) coincides with Neyman’s null, \(H^{n}_0(N)\), in (10) under homogeneous treatment effect within the sample, i.e. we can write

$$\begin{aligned} H_{0}^{n}(F):SATE(\textbf{u}_{s}^{n})=0\text { against }SATE(\textbf{u} _{s}^{n})\ne 0. \end{aligned}$$

The exact Fisher randomization test (FRT) is performed by estimating the treatment effect under all possible permutations of the ‘potential’ outcomes under \(H_{0}^{n}(F)\). To see this, let \(A = \genfrac(){0.0pt}1{n}{n_{1}}\) and let the matrix \(\textbf{W} = (w_{ij}) \in R^{n\times A}\) arrange all possible random allocations in a complete randomized experiment such that \(w_{ij} = 0\) if unit i is not treated and \(w_{ij} = 1\) if treated. Denoting \(\textbf{Y}(w) = (Y_1(w),...,Y_n(w))^{\prime },\) \(w = 0,1\), the vector of observed outcomes is defined as

$$\begin{aligned} \textbf{Y}\mathbf {=W}^{j}\mathbf {\mathbf {Y(}}1\mathbf {\mathbf {)+(1-W}}^{j})\mathbf {\mathbf {Y(}}0\mathbf {\mathbf {),}} \end{aligned}$$

where \(\textbf{W}^{j}=(W_{1}^{j},...,W_{n}^{j})^{\prime }\) is the specific allocation vector for jth experiment \(j = 1,...A\) and \(\textbf{1}\) is a vector of 1’s. The exact p-value for a two-sided hypothesis can then be obtained as

$$\begin{aligned} \pi _{F}=\Pr \left\{ |{\widehat{\tau }}(\textbf{W},\textbf{Y}\mathbf {)|)\ge |}{\widehat{\tau }}_{n}^{j}||H_{0}^{n}(F)\right\} \end{aligned}$$
(14)

where \({\widehat{\tau }}(\textbf{W},\textbf{Y}\mathbf {)}\) is the symmetric distribution of estimates under the null over all A allocations in \(\textbf{W}\). As this test is derived solely from the actual randomization, the size of the test is always correct.

To consider the asymptotic case, a direct application of a central limit theorem in Ding (2017) ensures asymptotic normality of \({\widehat{\tau }}(\textbf{W},\textbf{Y}\mathbf {)}\) under \(H_{0}^{n}(F)\), with limiting variance

$$\begin{aligned} Var({\widehat{\tau }}(\textbf{W},\textbf{Y}|H_{0}^{n}(F))=\frac{n}{ n_{1}n_{0}}s^{2}, \end{aligned}$$

where \(s^{2}=(n-1)^{-1}\sum _{i=1}^{n}(Y_{i}-{\overline{Y}})^{2}\) and \({\overline{Y}}=n^{-1}\sum _{i=1}^{n}Y_{i}\). Comparing this variance of the normal approximation of Fisher’s exact test with that of the Neyman’s conservative test, the discrepancy comes out to be (see Ding 2017)

$$\begin{aligned}{} & {} {{\,\textrm{Var}\,}}({\widehat{\tau }}(\textbf{W},\mathbf {Y|}H_{0}^{n}(F)\mathbf {)-}Var(N)\nonumber \\{} & {} \quad = \left( \frac{1}{n_{0}}-\frac{1}{n_{1}}\right) \times (S_{1}^{2}-S_{0}^{2})+\frac{1}{n}({\overline{Y}}(1)-{\overline{Y}}(0))^{2}+o_{p}(n^{-1}), \end{aligned}$$
(15)

using \(Var(N) = S^2_{Y(1)}/n_{1} + S^2_{Y(0)}/n_{0}\). This implies that, Fisher’s and Neyman’s tests are asymptotically equivalent under the null if either \(n_0 = n_1\) or if \(S^2_1 = S^2_0\). Otherwise, the relative difference of variances grows with the size of the treatment effect.

2.2.1 Permutation testing

Let \(F_{Y(w)}\) denote the distribution of the potential outcomes in the population, where \(w=0, 1\). For a random sample of observed outcomes \(\left\{ Y_i(W_i)\right\} _{i=1}^{n}\) that are exchangeable (e.g. under \({{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n})=\tau \) with homoscedasticity), Hoeffding (1951)’s well-know permutation CLT gives (see also Li and Ding 2017; Boos and Stefanski 2013), under the null,

$$\begin{aligned} {\widehat{\tau }}(\textbf{W},\textbf{Y}\mathbf {)}/\sqrt{{{\,\textrm{Var}\,}}({\widehat{\tau }}_{\infty ,n})} \xrightarrow {{\mathcal {D}}} N(0, 1), \end{aligned}$$

as \(n \rightarrow \infty \). Romano (1990) showed that if the distributions of \(\left\{ Y_{i}(1)\right\} _{i=1}^{n_{1}}\) and \(\left\{ Y_{i}(0)\right\} _{i=1}^{n_{0}}\) have common mean \(\mu \) and finite variances \(\sigma _{1}^{2}\) and \(\sigma _{0}^{2}\), and if \(n_{1}/n_{0}\rightarrow 1\), then

$$\begin{aligned} {\widehat{\tau }}(\textbf{W},\textbf{Y}\mathbf {)}\overset{d}{\mathbf {\rightarrow }}{\widehat{\tau }}_{\infty ,n}/\sqrt{V({\widehat{\tau }}_{\infty ,n})} \text { as }n\rightarrow \infty . \end{aligned}$$

Note that, if the exact FRT is carried out using the statistic

$$\begin{aligned} {\widehat{\tau }}_{\infty ,n}/\sqrt{\frac{s_{Y_{1}}^{2}}{n_{1}}+\frac{ s_{Y_{0}}^{2}}{n_{0}}}, \end{aligned}$$

then its type I error asymptotically coincides with the nominal \(\alpha \) under \(H_{0}(N)\), and it also retains the exact error rate of \(\alpha \) in finite samples under the sharp null (Chung and Romano 2013). Furthermore, under normality of the super populations, i.e. \(Y(w)\sim N(\mu _{w},\sigma ^{2})\), \(w = 0, 1\), the exact FRT is the UMP test ( Lehmann 1959, §5.8). Further, if \(n_{1}/n_{0}\) is bounded, where \(n\rightarrow \infty \), then the exact FRT can be approximated with the standard t-test.

3 Power of exact FRT for large n

Now, consider power. For exact FRT, power under an alternative is a fixed quantity for given set of \(\textbf{Y}(1)\) and \(\textbf{Y}(0)\) in the design space \(\textbf{W}\). For each allocation vector \(\textbf{W}^{j},j=1,...,A\), there is a corresponding \(\textbf{Y}\) and an exact p-value, \(\pi _{F}\), defined by Eqn. (14). The power of the exact FRT for inference to the units of the sample is defined as the proportion of the A allocations in \(\textbf{W}\) that achives \(\pi _{F}\) smaller than or equal to \(\alpha \). A Monte Carlo simulation of “exact power” would thus require \(A^{2}\) calculations.

For large n, \(\pi _{F}\) can be approximated through simulations, and estimate \(\pi _{F}\) using \(M<A\) randomly drawn allocations. For sufficiently large M, this approximation, say \({\widehat{\pi }}_{F}\), will be close to \(\pi _{F}\). Achieving a similar accuracy for estimated power through simulation-based approximation obviously needs \(M^{2}\) computations.

We dedicate this section to discuss an alternative approximation strategy, where we do exact FRT test and power computations by using independent subsets of allocations, and then averaging the results over the subsets. To motivate the case, we begin by briefly reviewing the small simulation study reported in Ding (2017).

3.1 Motivation

The following description is directly taken from Ding (2017). Let \(Y(0)\sim N(0,1/16)\) and \(n=100.\) For Y(1), the data generation process (DGP) is bifurcated as: (a) \(Y(1)\sim N(\tau ,1/16),\) \(n_{1}=n_{0}=50\), and (b) \(Y(1)\sim N(\tau ,1/4),\) \(n_{1}=70\), \(n_{0}=30\), where \(\tau =1/10\) in both (a) and (b). Since, \(Y_{i}(0)\ne Y_{i}(1),\) \(\forall i\in \textbf{u}_{s}^{100}\) for all generated samples, \(\tau =0\) will differ from \(H_{0}^{n}(F)\) or, generally, from \(H_{0}^{n}(N)\) in both DGPs. To make it more precise, let \(Y_{i}(0)=\varepsilon _{0i}\) and \(Y_{i}(1)=\tau +\varepsilon _{1i}\), assuming \({{\,\textrm{E}\,}}(\varepsilon _{0i}) = {{\,\textrm{E}\,}}(\varepsilon _{1i})=0\) in the super population. Then, for a sample of size \(n=100\),

$$\begin{aligned} {{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{100})=\tau +{\overline{\varepsilon }}_{1s}-{\overline{\varepsilon }}_{0s}, \end{aligned}$$

with \({\overline{\varepsilon }}_{ws} = \sum _{i=1}^{100}\varepsilon _{wi}/100\), \(w = 0, 1\), where \({\overline{\varepsilon }}_{0\,s} \ne {\overline{\varepsilon }}_{1\,s}\) for most samples. Averaging over 1000 simulation runs, Ding (2017) computes the p-values of the Neyman’s test as

$$\begin{aligned} 2\Phi \left( -{\widehat{\tau }}_{100}^{j}/\sqrt{\widehat{{{\,\textrm{Var}\,}}({\widehat{\tau }}_{100}^j)}}\right) ,~~j=1,...,1000, \end{aligned}$$

and the p-values of the exact FRT are approximated as

$$\begin{aligned} {\widehat{\pi }}_{F}^{j}=\frac{1}{M}\sum _{m=1}^{M}\textbf{1}\left( |{\widehat{\tau }}(\textbf{W}^{m},\textbf{Y}\mathbf {)|\ge |}{\widehat{\tau }}_{100}^{j}|\right) ,~~j=1,...,1000, \end{aligned}$$
(16)

where \(M\equiv 10^{5}\) and \(\textbf{1}(\cdot )\) is an indicator function. This, however, implies that the simulation set up in Ding (2017) is not for testing \(H_{0}^{n}(N)\). As alluded to above, this set up rather pertains to the power for testing \(H_{0}(N)\), although using only a single sample from this super population.

From Table 1 in Ding (2017), where the results of the two DGPs are reported, we note an overall power, for case (a), as 0.512 for Neyman’s test and 0.497 for Fisher’s test. The results are not only similar, but also close to the expected power under repeated sampling, i.e. \(=1-\Phi (-0.04)+\Phi (-3.96)\) = 0.516. For case (b), the power of Neyman’s test is 0.07 while that of Fisher’s test is 0.008, which is even lower than the nominal level. Both are however far from the expected power under repeated sampling which, in this case, is \(=1-\Phi (0.6302)+\Phi (-3.2898)\) = 0.265.

The crux of the aforementioned comparison is that Ding (2017) does not consider all possible estimates \({\widehat{\tau }}_{100}^{j},\) \(, j=1,...,A\), rather only the estimates in the set \({\widehat{\tau }}_{100}^{j}\), \(j=1,...,1000\), in a subset \(\textbf{W}_{B_{1}}\) of \(\textbf{W}\). The subset \(\textbf{W}_{B_{1}},\) with card(\(\textbf{W}_{B_{1}})=1000,\) is only one set out of \(\genfrac(){0.0pt}1{A}{1000}\) with \(A=\genfrac(){0.0pt}1{100}{50}\) and \(\genfrac(){0.0pt}1{100}{70}\), respectively. Another random subset, say \(\textbf{W}_{B_{2}}\), would most likely give other results. The problem is less pertinent for Neyman inference since the corresponding statistic does not depend on the empirical distribution over all allocations. However as seen for the DGP (b), the power can be quite far from expected even for Neyman’s test.

3.2 Exact FRT in allocation subsets

The power of a size \(\alpha \) FRT can be computed as

$$\begin{aligned} p_{F}=\frac{1}{A}\sum _{j=1}^{A}\mathbf {1(}\pi _{F}^{j}\le \alpha ), \end{aligned}$$

where

$$\begin{aligned} \pi _{F}^{j}=\frac{1}{A}\sum _{m=1}^{A}\textbf{1}\left( |{\widehat{\tau }}(\textbf{W}^{m},\textbf{Y})|\ge |{\widehat{\tau }}_{n}^{j}|\right) ,~~m=1,...,A. \end{aligned}$$
(17)

Ding (2017) selected a subset of 1000 allocations and calculated the power within this set, i.e.

$$\begin{aligned} p_{F|1000}=\frac{1}{1000}\sum _{j=1}^{1000}\mathbf {1(}{\widehat{\pi }}_{F}^{j}\le \alpha ). \end{aligned}$$

The power calculated over the subset \(\textbf{W}_{B_{1}}\) may not be a good approximation of the real power of the exact FRT, whether or not the complete set of allocations or the Monte Carlo approximation of the exact p-value is used. If \(10^{5}\) allocations provide good enough precision for the approximation of p-values, then the power should also be well approximated by the same set. This, however, requires \(10^{10}\) iterations in one cell, which is simply not possible. We, instead, will make use of the algorithm developed in Johansson and Schultzberg (2020).

In the re-randomization context, Johansson and Schultzberg (2020) suggest an alternative to Monte Carlo approximation of the p-value with large n. For the following discussion of their approach, it is instructive to reformulate the definition of the SATE (cf. Eqn. 3) as the average of all potential estimates in a sample, i.e.

$$\begin{aligned} {{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n})=\frac{1}{A}\sum _{j\in \textbf{W}}{\widehat{\tau }}_{n}^{j}. \end{aligned}$$
(18)

In (18), it follows from the symmetry that unbiasedness of the MDE stems from that for any single allocation; e.g. \(\textbf{W}^{j}=(0,1,0,1,...,0,1)^{\prime },\) there exist a mirror allocation with 1’s and 0’s exchanged.Footnote 3 This means, the unbiasedness can be preserved for any set of allocations \(\textbf{W}_{B_{k}}\), \(k=1,...,K,\) with cardinality larger than two,  as long as the set includes only pairs of mirror allocations. To emphasize a set containing only mirror allocations, we add the superscript \(*\). For example, \(\textbf{W}_{B_{k}}^{*}\) is a set of allocations of cardinality \(B_{k},\) i.e. card(\(\textbf{W}_{B_{k}}^{*})=B_{k},\) containing \(B_{k}/2\) pairs of mirror allocations.

Here we simply take K random subsets, all of size \(B^{*}\), where \(B^{*}\) is small enough to conduct the exact test. The exact p-value for a two sided hypothesis test for a given sample for each subset of allocations is thus defined as

$$\begin{aligned} \pi _{F|B_{k}}=\Pr (|{\widehat{\tau }}(\textbf{W}_{B_{k}}^{*},\textbf{Y}\mathbf {)|)\ge |}{\widehat{\tau }}_{n}^{j}(k,s)|,k=1,...,K, \end{aligned}$$

where \({\widehat{\tau }}_{n}^{j}(k,s)\) denotes an estimate for kth subset in sth sample. Even though the level of this test is always correct and the MDE is unbiased, it is important to note that the distributions of \({\widehat{\tau }}(\textbf{W}_{B_{k}}^{*},\textbf{Y}\mathbf {)}\) and \({\widehat{\tau }}(\textbf{W},\textbf{Y}\mathbf {)}\) generally differ, which explains why the power with \(\textbf{W}_{B_{k}}^{*}\) generally differs from the exact power. Computing power by averaging over the K subsets reduces the approximation error.

4 Simulation study of exact FRT

To validate the discussion so far, we perform three simulation studies. The first re-examines the small simulation in Ding (2017). The second uses the same DGP but with \(n=12\). In the third, we study the performance of the two strategies in a pair wise stratified experiment. The two latter cases allow us to obtain the exact power of the exact FRT.

4.1 Case I: re-examining Ding (2017)’s study

We extend the DGPs in Ding (2017), keeping \(Y(0)\sim N(0,1/16)\) throughout where, for the alternative, we let (a) \(Y(1)\sim N(\tau ,1/16)\) with \(n_{1}=n_{0}=50\), (b) \(Y(1)\sim N(\tau , 1/4)\) with \(n_{1}=70\), \(n_{0}=30\), and (c) \(Y(1)\sim N(\tau ,1/4)\) with \(n_{1}=30\), \(n_{0}=70\). Settings (a) and (b) with \(\tau =0.10\) are as in Ding (2017), whereas setting (c) is considered as an extension thereof.

For power, we consider an increasing alternative with \(\tau = \{0, 0.05, 0.1, 0.15, 0.2,0.25, 0.3\}\). As another extension, the same simulations are also conducted under homogeneous effects, where the outcome under treatment is generated as \(Y(1)=Y(0)+\tau \). As in Ding (2017), we test \(H_{0}(N):\tau =0\) using the 5% level. But we repeat the test over \(s=1,...100\). Likewise, we let \(B^{*}=1000\), but calculate the power of Neyman and the exact test for each random subset \(\textbf{W}_{B_{k}}^{*},k=1,...,50\) of \(\textbf{W}\). That is, for Neyman test we calculate the power as

$$\begin{aligned} p_{N}(k,s,\tau )=\frac{1}{1000}\sum _{j\in B_{k}}\mathbf {1(}\pi _{N}^{j}(s,\tau )\le 0.05), \end{aligned}$$

with \(k = \{1,...,50\}\), \(s = \{1,...,100\}\) and \(\tau \) given above, where each p-value, \(\pi _{N}^{j}(s,\tau )\), is the standard p-value from a two-sample t-test with the Satterthwaite approximation of the degrees of freedom (Welch 1947),Footnote 4 The power of Fisher’s test is computed as

$$\begin{aligned} p_{F|B_{k}}(k,s,\tau )=\frac{1}{1000}\sum _{j\in B_{k}}\textbf{1}(\pi _{F|B_{k}}^{j}(k,s,\tau )\le 0.05), \end{aligned}$$
(19)

with \(k, s, \tau \) as above, where, for \(j=1,...,1000\),

$$\begin{aligned} \pi _{F|B_{k}}^{j}(k,s,\tau )=\frac{1}{1000}\sum _{m\in B_{k}^{*}}\textbf{1}(|{\widehat{\tau }}(\textbf{W}_{B_{k}}^{*m},\textbf{Y}\mathbf {))|\ge |}{\widehat{\tau }}_{n}^{j}(k,s)|). \end{aligned}$$

The number of replicates in a complete MC simulation should be \(1.0089\times 10^{29}(=\genfrac(){0.0pt}1{100}{50})\) for DGPs (a) and (b), and \(2.9372\times 10^{25}\) (\(=\genfrac(){0.0pt}1{100}{70}=\genfrac(){0.0pt}1{100}{30})\) for (c). Here the number of replicates are \(50\times 1000=50,000\) for each cell when averaging over the subsets.

The results for all DGPs and effect types (homogeneous vs. heterogeneous) are summarized using a four-way analysis of variance (ANOVA), with factors (i) Inference (Fisher and Neyman), (ii) Effect size (seven levels), (iii) Subset (50 levels), and (iv) Sample (100 levels). We restrict the analysis to second order interactions. The upper panel of Table 1 reports for the homogeneous effects, with results corresponding to DGPs (a), (b) and (c) in columns 3–4, 5–6 and 7–8, respectively; the lower panel depicts the same set up for the heterogeneous case.

The strength of the coefficient of determination, \(R^{2}\), exceeding 99% for all layouts, is an encouraging indicator for the summarization of results by ANOVA. The first take-home message from Table 1 is that the effect size is the most important factor, as expected. Secondly, the sample factor contributes quite substantially in explaining the variance, both as main effect but also in terms of interaction with the effect size. Again as expected, this particularly holds for heterogeneous effects case. Thirdly, differences in inference contributes to a small extent to the variation. However, as this factor has just one degrees of freedom (DF), the F-statistic is quite high, even though not close to the F-statistic for the sample (including interaction effects). The final conclusion pertains to the Subset factor, in that it does not add much to the explanation of the variance, which means the approximation errors incurred by using the subsets in the simulations is small.

Table 1 ANOVA results for the Monte Carlo simulation with \(n=100\)

Figure 1 depicts the simulation results across \(\tau \), with heterogeneous and homogeneous effects displayed in left and right panels, and DGPs (a), (b) and (c) displayed in upper, middle and lower panels, respectively. The box plots display inference to the experiment across 100 random samples, i.e. the fraction of rejected tests across \(\tau \) in each of 100 experiments. With homogeneous effect, the fraction of rejected tests provide the size when \(\tau =0\) and power when \(\tau > 0\). Inference to the population is obtained by averaging the fraction of the rejected over the 100 random samples, as displayed by the power curves.

Fig. 1
figure 1

Power comparison of Fisher and Neyman test for 100 independent samples in a complete randomized experiment with \(n=100\), where \(n_1 = 50\) and \(\sigma _{Y(0)}^{2}=\sigma _{Y(1)}^{2}=1/16\) in the top panel, \(n_1 = 70\) and \(\sigma _{Y(1)}^{2}=1/4\) in the middle panel, \(n_1 = 30\) and \(\sigma _{Y(1)}^{2}=1/4\) in the bottom panel. Solid line is the power averaged over sets and samples

For homogeneous effects, we observe similar results with respect to inference to the experiment and to the population. With respect to inference to the experiment, the Fisher test has by design the correct size, while there is a very small divergence for the Neyman test. There is a small divergence in power within samples. The maximum divergence in power between the two tests is 12%, for \(\tau \) = 0.15 in panel B. This suggests that, for homogeneous effects, the conclusion from a single experiment does not generally depend on the type of test conducted. As seen by the power curves, the two tests are indistinguishable for the inference to the population.

For heterogeneous effect, we observe substantial variation in fraction of rejected within samples. As expected, a substantial amount of experiments rejects the hypothesis of \(\tau =0\). The pattern is similar between the two tests in the balanced case and equal variance for case (a). The maximum divergence in power between the two tests is 12.5%, for \(\tau =0.15\). However with unbalanced designs and unequal variances substantial differences between the tests are seen. With \(n_{1}=70\) (panel (b)) there are more cases rejecting \(\tau =0\) for the Neyman test, while it is the reverse when \(n_1=30\) (panel (c)). Thus with heterogeneous effects in unbalanced design, the conclusion for Fisher’s test from a single experiment may likely differ from that using t-test. These differences in inference to the experiment is summarized in the power curves in the figure. For (a), the size and power over random sampling is the same for both strategies, which confirms the result in Lehmann (1959, §5.8). For (b), the average size of the t-test exceeds 5% and, neglecting the size distortion, the t-test is also more powerful than the exact Fisher test. For (c), the roles of the two tests are reversed.

The last thing to note from the figure is that there is no sign of a diverging difference in power with the effect size as suggested from Eqn. (15), neither for inference in the single experiment nor for inference over random sampling.

Note that, since we approach the numerical assessment by conducting exact Fisher inference within each subset and then averaging over all subsets, its usefulness can be gauged from the act that the results in Fig. 1 seem validating the theoretical results in Lehmann (1959). In the next two sub-sections, we consider the performance in small experiment, where simulations are conducted within the complete set of allocations.

4.2 Small sample version

We use the same DGP as above, also keeping 100 random draws from the super population, now with \(n=12\) and \(\tau = \{0, 0.25, 0.5, 0.75, 1, 1.25\}\), where \(n_{1}\) is 6, 8 and 4 for DGP (a), (b) and (c), respectively. Figure 2 presents the results from the simulations across \(\tau \).

The box plots and lines display, respectively, average size and power over 100 samples. The results for heterogeneous and homogeneous effects are displayed in the left and right panels, respectively, where DGPs (a)–(c) are displayed in top, middle and bottom panels, respectively. The overall picture is the same as with \(n=100.\) There is substantial variation in power for heterogeneous effects, but also the power curves for the two strategies overlap. There is also no sign of a diverging difference in power with the effect size as suggested by Eqn. (15). For homogeneous effects, there is some size distortion for t-test, and exact Fisher test is somewhat more powerful.

Fig. 2
figure 2

Difference in power between Fisher and Neyman inference over \(\genfrac(){0.0pt}1{12}{6}=924\), \(\genfrac(){0.0pt}1{12}{8}=495\) and \(\genfrac(){0.0pt}1{12}{4}=495\) possible allocations in 100 independent samples, respectively, in top, middle and lower panels. The solid line is the power averaged over the 100 samples

4.2.1 Pairwise stratification

Now we consider a pairwise stratified experiment. Let \(Y_{ij}(w)\), \(w=0,1\), \(\ i=1,...,n\), \(j=1,2\), be potential outcomes of units in a matched-pair experiment with 2n units (n pairs). The within-pair estimator

$$\begin{aligned} {\widehat{\tau }}_{i}=W_{i}(Y_{i1}-Y_{i2})+(1-W_{i})(Y_{i2}-Y_{i1}) \end{aligned}$$

is unbiased for the within-pair average causal effect \(\tau _{i}\), where

$$\begin{aligned} {\widehat{\tau }}_{n}(s)=\frac{1}{n}\sum _{i=1}^{n}{\widehat{\tau }}_{i} \end{aligned}$$

is an unbiased estimator of the sample treatment effect

$$\begin{aligned} {{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n})=\frac{1}{n}\sum _{i=1}^{n} \tau _{i}. \end{aligned}$$

Following Ding (2017), the conservative variance estimator of \({\widehat{\tau }}_{n}(s)\) is

$$\begin{aligned} {{\,\textrm{Var}\,}}(N)=\frac{1}{n(n-1)}\sum _{i=1}^{n}({\widehat{\tau }}_{i}-{\widehat{\tau }}_{n}(s))^{2}, \end{aligned}$$

where Ding’s Theorem 4 show that

$$\begin{aligned} {{\,\textrm{Var}\,}}({\widehat{\tau }}(\textbf{W},\textbf{Y}|H_{0}^{F})) = \frac{1}{n^2}\sum _{i=1}^{n}{\widehat{\tau }}_{i}^{2}, \end{aligned}$$

so that

$$\begin{aligned} {{\,\textrm{Var}\,}}({\widehat{\tau }}(\textbf{W},\textbf{Y}|H_{0}^{F})) - {{\,\textrm{Var}\,}}(N) = \frac{1}{n}{{\,\textrm{SATE}\,}}(\textbf{u}_{s}^{n})^{2}+o_{p}(n^{-1}). \end{aligned}$$
(20)

We draw 100 independent samples of size \(n=30\). For each sample, the p-values for all \(2^{n/2}\) = 32,768 possible allocations under the paired design are calculated. We again assume

$$\begin{aligned} Y(0)\sim N(0,1/16), \end{aligned}$$
(21)

and generate the counterfactual as

$$\begin{aligned} \text {(a)}&\text {:}&\text { }Y(1)\sim N(\tau ,1/16) \end{aligned}$$
(22)
$$\begin{aligned} \text {(b)}&\text {:}&\text { }Y(1)=Y_{i}(0)+\tau , \end{aligned}$$
(23)

where \(\tau = \{0, 0.05, 0.15, 0.25, 0.35, 0.45\}\). The size and power for each sample is computed as the proportion of corresponding p-values below \(\alpha =0.05\). The results for heterogeneous effect under DGP (a) are shown in the left panel of Fig. 3 and those of homogeneous effect under DGP (b) are in the right panel. The most important finding is that the power curves of the two tests overlap so that we cannot discover any divergence in the power of t-test in comparison with the Fisher’s test, as expected from Eqn. (20).

Fig. 3
figure 3

Difference in power between Fisher and Neyman inference over all \( 2^{15}=32,768 \) possible allocations under pairwise stratified randomization in 100 independent samples with \(n=30\). The solid line is the power averaged over samples

5 Discussion

Ding (2017) gives interesting theoretical results on the comparison of Neyman’s and Fisher’s two-sample inference based on the theory of potential outcomes. The present paper examines to what extent Ding (2017)’s results apply for the exact Fisher test for inference to the sample under the alternative. Based on the same data generating processes as in Ding (2017), we conduct a Monte Carlo study that captures the finite, but large, sample power properties of the exact Fisher randomization test. The results show no overall superiority of the Neyman test over the exact Fisher randomization test for any effect size. Instead, the property of a test being most powerful in this case seems to depend on the characteristics of the outcomes in the given sample.

For heterogeneous treatment effect, both tests have in general wrong size when testing the population (or scientific) null in a single experiment, which illustrates Fisher’s concern (Fisher 1955). The crux in the single experiment case is that the sample average treatment effect (SATE) depends on the units sampled to the experiment and, with N fixed, SATE in general differs from zero. This fact perhaps pertains to what Fisher (1955, p. 69) meant with the statement “we consider a continuum of hypotheses, each eligible as null hypothesis”. The within-sample asymptotic theory solves the problem by assuming the sample to be infinite, but most experiments are conduced on a finite sample whence the Neyman and Fisher tests have the same problem testing the scientific null.

It is however interesting to note that when testing for a sample (or statistic) null, the two tests may give different conclusions, at least so in unbalanced designs and with unequal variances. This gives some food for thought for theoretical research since at the end of the day, all we as statisticians have is the results from a single experiment.