1 Introduction

Suppose we observe data \(X_{1},\dots ,X_{n}\in \mathcal {X}\), and would like to test the null hypothesis

$$ H_{0}: X_{1},\dots,X_{n}\ \text{are exchangeable}. $$
(1.1)

(Note that the hypothesis that the Xi’s are i.i.d., is a special case of this null.) We assume that we have a pre-specified test statistic, which is a function \(T: \mathcal {X}^{n} \rightarrow \mathbbm {R}\), where, without loss of generality, we let larger values of \(T(X) = T(X_{1},\dots ,X_{n})\) indicate evidence in favor of an alternative hypothesis.

Since the null distribution of the Xi’s is not specified exactly, we usually do not know the null distribution of T(X). The permutation test avoids this difficulty by comparing T(X) against the same function applied to permutations of the data. To elaborate, let \(\mathcal {S}_{n}\) denote the set of all permutations on \([n]:=\{1,\dots ,n\}\), and define

$$ x_{\sigma} := (x_{\sigma(1)},\dots,x_{\sigma(n)}) $$

for any \(x\in \mathcal {X}^{n}\) and any \(\sigma \in \mathcal {S}_{n}\). Then, we can compute a p-value

(1.2)

which ranks T(X) amongst \(\{T(X_{\sigma })\}_{\sigma \in \mathcal {S}_{n}}\) sorted in decreasing order. Then, under the null hypothesis H0, P is a valid p-value, meaning \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \) for all α ∈ [0,1].Footnote 1

As an example, suppose that the observed data set actually consists of pairs (Xi,Yi), which are assumed to be i.i.d. from some joint distribution. If we are interested in testing , we can reframe this question as testing whether \(X_{1},\dots ,X_{n}\) are i.i.d. conditional on \(Y_{1},\dots ,Y_{n}\)—in particular, under \(H_{0}^{\prime }\), it holds that X follows an exchangeable distribution conditional on Y. Our test statistic T might be chosen as

$$T(X) = \left|\text{Corr}\big((X_{1}, \dots,X_{n}),(Y_{1},\dots,Y_{n})\big)\right|.$$

In order to see whether the observed correlation is sufficiently large to be statistically significant, we would compare T(X,Y ) to the correlations computed on permuted data,

$$T(X_{\sigma})=\left|\text{Corr}\big((X_{\sigma(1)},\dots,X_{\sigma(n)}),(Y_{1},\dots,Y_{n})\big)\right|.$$

The resulting p-value computed as in Eq. 1.2 is then a valid p-value under the null hypothesis \(H_{0}^{\prime }\). In addition to testing independence, permutation tests are also commonly used for testing other hypotheses, such as whether two samples follow the same distribution.Footnote 2

The p-value P computed in Eq. 1.2 requires computing T(Xσ) for every \(\sigma \in \mathcal {S}_{n}\). One may naturally be interested in reducing the computational cost of this procedure, since computing T(Xσ) for \(|\mathcal {S}_{n}|=n!\) many permutations may be computationally prohibitive for even moderately large n. As is well known, we can obtain valid p-values by uniformly randomly sampling permutations from \(\mathcal {S}_{n}\) and computing

(1.3)

in which σ1,…,σM are i.i.d. uniform draws from \(\mathcal {S}_{n}\).

In a different direction, one can also reduce the set of permutations σ to subsets of \(\mathcal {S}_{n}\). Specifically, let \(G\subseteq \mathcal {S}_{n}\) be any subset, and define

(1.4)

where |G| is the cardinality of G.

Clearly, if G does not contain the identity permutation, then P cannot be a p-value because it could potentially take on the value zero. However, including the identity permutation is not sufficient. The literature repeatedly emphasizes that P defined in Eq. 1.4 is a valid p-value only if the subset \(G\subseteq \mathcal {S}_{n}\) is in fact a subgroupFootnote 3 (Hemerik and Goeman, 2018, Theorem 1).

The subgroup G mentioned above may be chosen strategically to balance between computational efficiency and the power of the test (see, e.g., Hemerik and Goeman (2018) and Koning and Hemerik (2022)). In case G has a large cardinality, the aforementioned references show that sampling permutations uniformly at random from G also yields valid p-values—that is, the randomized p-value P from Eq. 1.3 is valid if \(\sigma _{1},\dots ,\sigma _{M}\) are i.i.d. samples drawn uniformly from G rather than from \(\mathcal {S}_{n}\). Again, choosing G to be a subgroup (rather than an arbitrary subset), and sampling uniformly rather than from an arbitrary distribution, are both important for the validity of the resulting p-value.

1.1 Contributions

The background above naturally leads to the following question: while it is indeed correct that P from Eq. 1.4 is not a p-value for general subsets \(G \subseteq \mathcal {S}_{n}\), is it possible to slightly modify the definition of P so that it retains its validity for subsets G that are not subgroups? Further, while sampling the σm’s nonuniformly from G would destroy the validity of P from Eq. 1.3, can we modify the definition of P so that nonuniform sampling from a set is allowed? The first question is addressed by Hemerik and Goeman (2018), as we will describe below; to our knowledge, the second question has not been addressed in the literature.

In this paper, we will broaden the applicability of permutation tests and present generalizations, which yield valid p-values even when we sample permutations—with or without replacement—from a non-uniform distribution over all permutations or from an arbitrary subset of permutations. In doing so, we shall carefully explain how this generalization relates and extends all previous options. This generalization yields new and more flexible permutation test methods; we leave a detailed study of pros and cons of these generalizations (such as how they trade off the two types of errors) to future work.

2 A Generalized Permutation Test

2.1 Testing with an Arbitrary Distribution

We now present our first generalization of the permutation test. It allows us to use any (not necessarily uniform) distribution over \(\mathcal {S}_{n}\) in order to construct our permutation p-value.

Theorem 2.1.

Let q be any distribution over \(\sigma \in \mathcal {S}_{n}\). Let \(\sigma _{0}\sim q\) be a random draw, and define

(2.1)

Then P is a valid p-value, i.e., \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \) for all α ∈ [0,1].

In this theorem, validity is retained when conditioning on the order statistics of the data, meaning that \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha | X_{(1)},\dots ,X_{(n)}\right \}\leq \alpha \), where \(X_{(1)}\leq {\dots } \leq X_{(n)}\) are the order statistics of \(X=(X_{1},\dots ,X_{n})\).Footnote 4 The reason that this holds is simply because H0 remains true even conditional on the order statistics—that is, if X is exchangeable, then \(X\mid (X_{(1)},\dots ,X_{(n)})\) is again exchangeable. The same conditional validity holds for all results to follow, as well. However, one cannot condition on σ0; the result only holds marginally over σ0, and this external randomization is key to retaining validity.

We defer the proof to Section 2.3—we will first discuss connections to the existing literature in order to provide more context and intuition for the above theorem.

Uniform Distribution Over a Subgroup

To begin with, assume q is the uniform distribution over a fixed subgroup G of \(\mathcal {S}_{n}\). Then in this case, the p-value in Eq. 2.1 takes the special form

where the second equality holds because a subgroup G is closed under inverses and composition, so \(\{\sigma \circ \sigma _{0}^{-1} : \sigma \in G\} = G\) for any σ0G. This simple observation recovers a well-known fact we discussed earlier; namely, one can restrict the set of permutations to an arbitrary subgroup, and the p-value P defined in Theorem 2.1 will then coincide with our earlier definition Eq. 1.4 (proved to be a valid p-value in (Hemerik and Goeman, 2018, Theorem 1)).

Uniform Distribution Over a Subset

Consider now a uniform distribution over an arbitrary subset S that is not a subgroup. In this case, the definition of P in Theorem 2.1 is equal to

(2.2)

as proposed earlier by Hemerik and Goeman (2018). This is, in general, not the same as

(2.3)

(which is equivalent to the quantity defined in Eq. 1.4 earlier, with the subset S in place of a subgroup G). As we shall see below, \(P^{\prime }\) is generally not a p-value, a fact which can cause large issues in practice, as has been frequently emphasized. For example, consider the tool of balanced permutations—in the setting of testing whether a randomly assigned treatment has a zero or nonzero effect, this method has been proposed as a variant of the permutation test in this setting, where the subset S consists of all permutations such that the permuted treatment group contains exactly half of the original treatment group, and half of the original control group. Southworth et al. (2009) show that the quantity \(P^{\prime }\) computed in Eq. 2.3 for this choice of subset S can be substantially anti-conservative, i.e., \(\mathbb {P}\left \{P\leq \alpha \right \}> \alpha \), particularly for low significance levels α. (See also Hemerik and Goeman (2018) for additional discussion of this issue.)

A simple example may help to illustrate this point, and to give intuition for the role of the random permutation σ0.

Example 2.1.

Let n = 4, and consider the set

$$ S = \{\text{Id} , \sigma_{1\leftrightarrow 3, 2\leftrightarrow 4}, \sigma_{1\leftrightarrow 4, 2\leftrightarrow 3}\},$$

where, e.g., \(\sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}\) is the permutation swapping entries 1 and 3 and also swapping 2 and 4. Let \(X_{1},X_{2},X_{3},X_{4}\overset {iid}{\sim } \mathcal {N}(0,1)\) be standard normal random variables (so that the null hypothesis of exchangeability, H0, is satisfied), and set T(X) = X1 + X2. Then the quantity \(P^{\prime }\) defined in Eq. 2.3 is equal to

This gives

$$ P^{\prime} = \begin{cases}\frac{1+0+0}{3} = \frac{1}{3}, & \text{ if }X_{3} + X_{4} < X_{1} + X_{2} , \\ \frac{1+1+1}{3}= 1, & \text{ otherwise}\end{cases} $$

and, therefore,

$$P^{\prime} = \begin{cases} \frac{1}{3},& \text{with probability} \frac{1}{2},\\ 1,& \text{with probability} \frac{1}{2}.\end{cases}$$

We can thus see that \(P^{\prime }\) is anti-conservative at the threshold \(\alpha = \frac {1}{3}\).

Next, we will see how the correction in Eq. 2.2 fixes the failure described above. Denote by Pσ the p-value in Eq. 2.2 calculated conditional on the random σ0 being equal to σ, so that

$$ P = \begin{cases} P_{\text{Id}}, & \text{ w.p. } 1/3, \\ P_{\sigma_{1\leftrightarrow 4, 2\leftrightarrow 3}}, & \text{ w.p. } 1/3, \\ P_{\sigma_{1\leftrightarrow 3, 2\leftrightarrow 4}}, & \text{ w.p. } 1/3. \end{cases} $$
(2.4)

Then, the calculation that was previously performed effectively shows that

$$ P_{\text{Id}} = \begin{cases} \frac{1}{3}, & \text{ if }X_{3} + X_{4} < X_{1} + X_{2} , \\ 1, & \text{ otherwise}.\end{cases} $$

A similar straightforward calculation then yields

$$ P_{\sigma_{1\leftrightarrow 3, 2\leftrightarrow 4}} = P_{\sigma_{1\leftrightarrow 4, 2\leftrightarrow 3}} = \begin{cases} \frac{2}{3}, & \text{ if }X_{3} + X_{4} < X_{1} + X_{2} , \\ 1, & \text{ otherwise}.\end{cases} $$

Put together, we obtain

$$ P = \begin{cases} \frac{1}{3}, & \text{ w.p. } 1/6, \\ \frac{2}{3}, & \text{ w.p. } 1/3, \\ 1, & \text{ w.p. } 1/2. \end{cases} $$
(2.5)

This is indeed stochastically larger than uniform, and is thus a valid p-value.

The role of σ 0

To better understand the role of the random permutation σ0, let us consider Example 2.1 again, and look more closely at what goes wrong there. We observe that \(P^{\prime }\) compares the observed statistic T(X) against the set \( \{T(X_{\sigma })\}_{\sigma \in S} = \{T(X_{\text {Id}}),T(X_{\sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}}) , T(X_{\sigma _{1\leftrightarrow 4, 2\leftrightarrow 3}})\}\). For \(P^{\prime }\) to be a valid p-value, given the (unordered) set of potential data vectors \(\{X_{\text {Id}},X_{\sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}} , X_{\sigma _{1\leftrightarrow 4, 2\leftrightarrow 3}}\}\), it suffices that the actual observed data X is equally likely to be any one of these three. Now suppose this set is equal to {(0.8,0.5,0.2,1.0),(1.0,0.2,0.5,0.8),(0.5,0.8,1.0,0.2)}, in no particular order. Each of these three vectors have equal likelihood under H0 (due to exchangeability). Counterintuitively, however, our knowledge of the subset of permutations S implies that we must have X = (1.0,0.2,0.5,0.8)—otherwise we could not have obtained this particular set. For instance, if X = (0.8,0.5,0.2,1.0), then we would have \(X_{\sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}} = (0.2, 1.0, 0.8, 0.5)\)—but this does not lie in our set, so it cannot be the correct value of X. In other words, if we condition on the unordered set {Xσ}σS, which is the orbit of the data X under the actions of permutations σS, our intuition tells us that X is equally likely to be any element of this orbit—but in fact, for a non-subgroup S, X might be uniquely identified from its orbit.

Now consider what happens if we compute the corrected p-value P in Eq. 2.2 and let us once more examine the question of identifying the data X from its orbit. The p-value P compares the observed statistic T(X) against the set \(\{T(X_{\sigma \circ \sigma _{0}^{-1}})\}_{\sigma \in S}\), and so now the question is whether we can identify X from the set \(\{X_{\sigma \circ \sigma ^{-1}_{0}}\}_{\sigma \in S}\), which is the orbit of \(X_{\sigma _{0}^{-1}}\) for a randomly drawn σ0S. Identifying X is no longer possible because of the random σ0. For instance, working again with Example 2.1, suppose this set \(\{X_{\sigma \circ \sigma ^{-1}_{0}}\}_{\sigma \in S}\) is equal to {(0.8,0.5,0.2,1.0),(1.0,0.2,0.5,0.8),(0.5,0.8,1.0,0.2)}, in no particular order. We can identify that this is the orbit of x = (1.0,0.2,0.5,0.8) under S—that is, this set is equal to {xσ}σS. Then the following three possibilities are equally likely:

  • σ0 = Id and so \(X = x_{\text {Id}^{-1}} = x = (1.0, 0.2, 0.5, 0.8)\);

  • \(\sigma _{0} = \sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}\) and so \(X = x_{\sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}^{-1}} = (0.5,0.8,1.0,0.2)\);

  • \(\sigma _{0} = \sigma _{1\leftrightarrow 4, 2\leftrightarrow 3}\) and so \(X = x_{\sigma _{1\leftrightarrow 4, 2\leftrightarrow 3}^{-1}} = (0.8,0.5,0.2,1.0)\).

In other words, X is now equally likely to be any of the three values in our set, and validity is restored.

2.2 Random samples from an arbitrary distribution

Our second generalization concerns permutations that are randomly chosen from an arbitrary distribution.

Theorem 2.2.

Let q be any distribution over \(\sigma \in \mathcal {S}_{n}\). Let \(\sigma _{0},\sigma _{1},\dots ,\) \(\sigma _{M}\overset {iid}{\sim } q\), and define

(2.6)

Then P is a valid p-value, i.e., \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \) for all α ∈ [0,1].

This result is closely related to Besag and Clifford (1989)’s well known construction for obtaining exchangeable samples from Markov chain Monte Carlo (MCMC) sampling—the details are deferred to Section 4.2.

Just as before, some special cases of this result are well known to statisticians.

Random permutations from \(\mathcal {S}_{n}\)

In the simple case where q is the uniform distribution over \(\mathcal {S}_{n}\), Theorem 2.2 states that

(2.7)

is a valid p-value. The equality in distribution above holds because the \({\sigma _{m}\circ \sigma _{0}^{-1}}\)’s are i.i.d. draws from \(\mathcal {S}_{n}\). Hence, this recovers the most commonly implemented form of the permutation test.

Random permutations from a subgroup

The distributional equality Eq. 2.7 extends to any uniform distribution q over a subgroup G of \(\mathcal {S}_{n}\) since in this case, as before, the random variables \({\sigma _{m}\circ \sigma _{0}^{-1}}\) are i.i.d. draws from G. This gives the following well-known result (see, e.g., Hemerik and Goeman (2018, Theorem 2)):

Corollary 2.1.

Let \(G\subseteq \mathcal {S}_{n}\) be a subgroup, and sample \(\sigma _{1},\dots ,\sigma _{M}\overset {iid}{\sim }\text {Unif}(G)\). Then

(2.8)

is a valid p-value, i.e., \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \) for all α ∈ [0,1].

Random permutations from a subset

Consider now the case where q is a uniform distribution over an arbitrary subset S. When S is not a subgroup, the p-value

may not have the same distribution as

Here, Theorem 2.2 gives:

Corollary 2.2.

Let \(S\subseteq \mathcal {S}_{n}\) be any fixed subset of permutations. Sample \(\sigma _{0},\sigma _{1},\dots ,\sigma _{M}\overset {iid}{\sim }\text {Unif}(S)\). Then

(2.9)

is a valid p-value, i.e., \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \) for all α ∈ [0,1].

To the best of our knowledge, this statement had not been recorded in the literature. As a variant, the same result holds if we instead draw permutations without replacement.

Corollary 2.3.

Consider the variant of Corollary 2.2 in which the permutations are drawn without replacement. Then the p-value P defined in Eq. 2.9 is a valid p-value, i.e., \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \) for all α ∈ [0,1]. In the special case where S is a subgroup, the same conclusion also applies for the p-value P defined in Eq. 2.8.

Proof 1.

Let \(S^{\prime }\subseteq S\) be a subset of size M + 1 chosen uniformly at random. Let \(\sigma _{0},\sigma _{1},\dots ,\sigma _{M}\) be a random ordering of \(S^{\prime }\)—in particular, this means that σ0 is drawn uniformly from \(S^{\prime }\). Then by Theorem 2.1, applied with q taken to be the uniform distribution over \(S^{\prime }\), P is a valid p-value.

The second claim follows from the fact that

$$(\sigma_{1}\circ\sigma_{0}^{-1},\dots,\sigma_{M}\circ\sigma_{0}^{-1})\overset{d}{=} (\sigma_{1},\dots,\sigma_{M}),$$

whenever S is a subgroup. □

To guide the reader, Fig. 1 summarizes the connections between all the results presented thus far in the paper. Interestingly, as highlighted in the figure, Theorems 2.1 and 2.2 can be derived from each other; we will elaborate on this connection below.

Figure 1
figure 1

A flowchart to illustrate the connections between (most of) the results presented in this paper. Arrows point from more general results to their special cases

Finally, we present another simple example to highlight the necessity of the σ0 term, in the case of nonuniform sampling. Indeed, even “intuitive” modifications of the uniform sampling scheme may fail to produce valid p-values.

Example 2.2.

If one considers to be a Monte Carlo estimate of the p-value computed in Eq. 1.2, then a lower-variance estimate may be obtained by “antithetic sampling”—that is, pairing a random draw \(\sigma _{m}\in \mathcal {S}_{n}\) with its reverse \(\text {Rev}(\sigma _{m}) = (\sigma _{m}(n),\dots ,\sigma _{m}(1))\) (see, e.g., Mitchell et al. (2022) for an example of this variance reduction technique). However, using antithetic sampling can lead to an invalid p-value—specifically, if \(\sigma _{1},\dots ,\sigma _{M/2}\) are drawn uniformly at random from \(\mathcal {S}_{n}\) (or from some subgroup \(G\subseteq \mathcal {S}_{n}\)), and we then set σM/2+i = Rev(σi) for each \(i=1,\dots ,M/2\), then the quantity \(P^{\prime }\) may not be a valid p-value. For instance, suppose we take M = 2, so that σ2 = Rev(σ1) where σ1 is drawn uniformly from \(\mathcal {S}_{n}\). Take T(x) = X1 + Xn, and draw \(X_{i}\overset {iid}{\sim } \mathcal {N}(0,1)\). Then

Then we can verify that, conditional on σ1, if {σ1(1),σ1(n)} = {1,n} then P = 1, while if {σ1(1),σ1(n)}≠{1,n} then \(P^{\prime }=\frac {1}{3}\) or \(P^{\prime }=1\) each with probability \(\frac {1}{2}\), which yields

$$\mathbb{P}\left\{P^{\prime} = \frac{1}{3}\right\} = \frac{1}{2} - \frac{1}{n(n-1)} > \frac{1}{3},$$

with the last step holding if n > 3. We can thus see that \(P^{\prime }\) is anti-conservative at the threshold \(\alpha = \frac {1}{3}\).

2.3 Proof of Theorem 2.1

First, for any fixed \(\sigma ^{\prime }\in \mathcal {S}_{n}\), we have

(2.10)

because \(X \overset {d}{=} X_{\sigma ^{\prime }}\) under H0 (and note that \((X_{\sigma ^{\prime }})_{\sigma \circ \sigma '{}^{-1}} = X_{\sigma }\)). Next, we will apply a deterministic inequality by Harrison (2012): for all \(t_{1},\dots ,t_{N} \in [-\infty ,\infty ]\) and all \(\alpha ,w_{1},\dots ,w_{N} \in [0,\infty ]\),

Applying this bound with q(σ)’s in place of the wi’s, and T(Xσ)’s in place of the ti’s, we obtain

(2.11)

Finally, we have

where the third step holds since \(\sigma _{0}\sim q\) is drawn independently of the data X, while the last two steps apply Eq. 2.10 and Eq. 2.11.

2.4 Connecting Theorems 2.1 and 2.2

As mentioned earlier, Theorems 2.1 and 2.2 can be derived from each other. We now give these proofs to show the connection.

Proof 2 (Alternative proof of Theorem 2.1 (via Theorem 2.2 )).

Let \(\sigma _{0},\sigma _{1},\sigma _{2},\dots \overset {iid}{\sim } q\), and for any fixed M, define

By the Law of Large Numbers, we see that almost surely for all \(\sigma \in \mathcal {S}_{n}\), and therefore, \(P_{M}\rightarrow P\) almost surely, where P is the p-value defined in Eq. 2.1. In particular, this implies that PM converges to P in distribution, and therefore

$$\mathbb{P}\left\{P\leq \alpha\right\} = \lim_{M\rightarrow\infty}\mathbb{P}\left\{P_{M}\leq \alpha\right\} \leq \alpha,$$

where the last step holds since, for every M ≥ 1, PM is a valid p-value by Theorem 2.2. □

Proof 3 (Proof of Theorem 2.2 (via Theorem 2.1 )).

Let \(\sigma _{0},\sigma _{1},\dots ,\sigma _{M}\overset {iid}{\sim } q\), and define the empirical distribution

$$\widehat{q} = \frac{1}{M+1}\sum\limits_{m=0}^{M} \delta_{\sigma_{m}},$$

where δσ is the point mass at σ. Now we treat \(\widehat {q}\) as fixed. Let k be drawn uniformly from \(\{0,\dots ,M\}\) (that is, σk is drawn at random from \(\widehat {q}\)). Applying Theorem 2.1 with \(\widehat {q}\) in place of q, we then see that

is a valid p-value conditional on \(\widehat {q}\), and therefore also valid after marginalizing over \(\widehat {q}\). Since \(\sigma _{0},\dots ,\sigma _{M}\) are drawn i.i.d. and are therefore in a random order, we see that

which is the desired p-value. □

2.5 Another perspective: exchangeable permutations

Many of the results described above can be viewed through the lens of exchangeability—not on the data X (which we assume to be exchangeable under the null hypothesis H0), but on the collection of permutations used to define the p-value P.

Theorem 2.3.

Let \(\sigma _{0},\sigma _{1},\dots ,\sigma _{M}\in \mathcal {S}_{n}\) be a random set of permutations, which are exchangeable, i.e.,

$$ (\sigma_{0},\sigma_{1},\dots,\sigma_{M})\overset{d}{=} (\sigma_{\pi(0)},\sigma_{\pi(1)},\dots,\sigma_{\pi(M)}) $$

for any fixed permutation π on \(\{0,\dots ,M\}\). Then

is a valid p-value, i.e., \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \) for all α ∈ [0,1].

Many of the results stated earlier can be viewed as special cases—in particular, the results for a subgroup G, or for a subset S, as well as our more general result Theorem 2.2 for permutations drawn i.i.d. from q.

Proof 4.

To be clear, this theorem is essentially just a new perspective, and can be proved as a corollary to Theorem 2.1. To see why, let \(\sigma _{0},\dots ,\sigma _{M}\) be exchangeable, and let \(\widehat {q} = \frac {1}{M+1}{\sum }_{m=0}^{M}\delta _{\sigma _{m}}\) be the empirical distribution induced by the unordered set of drawn permutations. Then since \(\sigma _{0},\dots ,\sigma _{M}\) is exchangeable, conditional on \(\widehat {q}\) it holds that σ0 is a random draw from \(\widehat {q}\). Applying Theorem 2.1 with \(\widehat {q}\) in place of q gives the conclusion. □

However, we can also prove this result in a more intuitive way, using the framework of exchangeability:

Proof 5 (Alternative proof of Theorem 2.3 ).

Since the sequence \(\sigma _{0},\sigma _{1},\dots ,\) σM is exchangeable,

$$T(X_{\sigma_{0}}),T(X_{\sigma_{1}}),\dots,T(X_{\sigma_{M}})$$

is also exchangeable conditional on X. It is thus still exchangeable after marginalizing over X. Therefore, under the null hypothesis H0, the test statistic values

$$ T(X)=T(X_{\sigma_{0}\circ\sigma_{0}^{-1}}), \ T(X_{\sigma_{1}\circ\sigma_{0}^{-1}}),\ \dots, \ T(X_{\sigma_{M}\circ\sigma_{0}^{-1}}) $$
(2.12)

are also exchangeable—this follows immediately from the previous line because \(X \overset {d}{=} X_{\sigma _{0}^{-1}}\) under H0. This shows that the p-value P defined in Theorem 2.3 is valid. □

3 Averaging to reduce variance

The p-value P defined in Eq. 2.1 can equivalently be written as

$$P = \mathbb{P}_{\sigma \sim q}\left\{T(X_{\sigma\circ\sigma_{0}^{-1}})\geq T(X) | X,\sigma_{0}\right\}.$$

It is clear that P is random even if we condition on the observed data X, because of the randomness due to σ0. Consequently, in some settings P may be quite variable conditional on the data X, and this may be undesirable.

To address this issue, we can also consider averaging over σ0 (in addition to averaging over σ) in the calculation of P. This alternative definition is now a deterministic function of the observed data X, but may no longer be a valid p-value. Nonetheless, the following theorem shows a bound on the Type I error.

Theorem 3.1.

Let q be any distribution over \(\sigma \in \mathcal {S}_{n}\). Define

(3.1)

or equivalently,

$$\bar P = \mathbb{P}_{\sigma,\sigma_{0} \overset{iid}{\sim} q}\left\{\left.T(X_{\sigma\circ\sigma_{0}^{-1}})\geq T(X)\right|X\right\}.$$

Then \(\bar P\) is a valid p-value up to a factor of 2, i.e., \(\mathbb {P}_{H_{0}}\left \{\bar P\leq \alpha \right \}\leq 2\alpha \) for all α ∈ [0,1]. In other words, the quantity \(\min \limits \{2\bar P,1\}\) is a valid p-value.

Proof 6.

Draw \(\sigma _{0}^{(1)},\sigma _{0}^{(2)},\dots \overset {iid}{\sim } q\). Let

for each m ≥ 1. Then by Theorem 2.1, each Pm is a valid p-value. It is known (Rüschendorf, 1982; Vovk and Wang, 2020) that the average of valid p-values is a valid up to a factor of 2, i.e., for any M ≥ 1 the average \(\bar {P}_{M} = \frac {1}{M}{\sum }_{m=1}^{M} P_{m}\) satisfies \(\mathbb {P}\left \{\bar {P}_{M}\leq \alpha \right \}\leq 2\alpha \) for all α ∈ [0,1]. We can equivalently write

By the Law of Large Numbers, \(\bar {P}_{M}\) converges almost surely to the p-value \(\bar P\) defined in Eq. 3.1, which completes the proof. □

Returning to Example 2.1, we see that while P was a mixture of PId, \(P_{\sigma _{1\leftrightarrow 4, 2\leftrightarrow 3}}, P_{\sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}}\), we now have that \(\bar P\) is an average of these, meaning \(\bar P = \tfrac 13(P_{\text {Id}} + P_{\sigma _{1\leftrightarrow 4, 2\leftrightarrow 3}} + P_{\sigma _{1\leftrightarrow 3, 2\leftrightarrow 4}})\). Simplifying, we get

$$ \bar P = \begin{cases} \frac{5}{9}, & \text{ w.p. } 1/2, \\ 1, & \text{ w.p. } 1/2. \end{cases} $$

It is worth noting that this new quantity \(\bar P\) is neither more conservative nor more anti-conservative than the p-value P in Eq. 2.5 from earlier. This is perhaps a more general phenomenon: the average of p-values need not in general be anti-conservative, and indeed it could often be more conservative, than the original p-values.

Analogously, the p-value in Theorem 2.2, computed via random samples from q, can also be averaged to reduce variance.

Theorem 3.2.

Let q be any distribution over \(\sigma \in \mathcal {S}_{n}\). Let \(\sigma _{0},\sigma _{1},\dots ,\) \(\sigma _{M}\overset {iid}{\sim } q\), and define

(3.2)

Then P is a valid p-value up to a factor of 2, i.e., \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq 2\alpha \) for all α ∈ [0,1]. Thus, as before, the quantity \(\min \limits \{2\bar P,1\}\) is a valid p-value.

The proof is similar to that of Theorem 3.1, and we omit it for brevity.

4 Connections to the Literature

We next mention a few connections to the broader literature.

4.1 Permutation tests vs randomization tests

Hemerik and Goeman (2021) describe the difference between two testing frameworks, permutation tests (as studied in our present work) versus randomization tests. The difference is subtle, because randomization tests may still use permutations. Specifically, Hemerik and Goeman (2021) highlight

an important difference in mathematical reasoning between these classes: a permutation test fundamentally requires that the set of permutations has a group structure, in the algebraic sense; the reasoning behind a randomisation test is not based on such a group structure, and it is possible to use an experimental design that does not correspond to a group.

To better understand this distinction, we can consider a scenario where a fixed subset \(S\subseteq \mathcal {S}_{n}\), which is not a subgroup, is used for a randomization test rather than a permutation test. Consider a study comparing a treatment versus a placebo, with n/2 many subjects assigned to each of the two groups. We can use a permutation σ to denote the treatment assignments, with σ(i) ≤ n/2 indicating that subject i receives the treatment, and σ(i) > n/2 indicating that subject i receives the placebo. Now we switch notation, to be able to compare to permutation tests more directly—writing \(X= (1,\dots ,1,0,\dots ,0)\), suppose that we will assign treatments via the permuted vector Xσ, i.e., for each subject \(i=1,\dots ,n\), under this permutation σ the i th subject will receive the treatment if Xσ(i) = 1, or the placebo if Xσ(i) = 0.

Now suppose that we draw a random treatment assignment \(\sigma _{\text {asgn}}\sim \text {Unif}(S)\), from a fixed subset \(S\subseteq \mathcal {S}_{n}\) (for example, S may be chosen to restrict to treatment assignments that are equally balanced across certain subpopulations). After the treatments are administered, the measured response variable is given by \(Y=(Y_{1},\dots ,Y_{n})\). Fix any test statistic T(X) = T(X,Y ) (we will implicitly condition on Y ), and compute

(4.1)

Since σasgn was drawn uniformly from S, this quantity P is a valid p-value. In the terminology of Hemerik and Goeman (2021), this test is a randomization test, not a permutation test. While the set of possible treatment assigments {Xσ : σS} happens to be indexed by permutations σ, the group structure of permutations is not used in any way, and we do not rely on any invariance properties.

Comparing to the invalid p-value considered in Eq. 2.3, we can easily see the distinction: for a randomization test, the observed statistic is \(T(X_{\sigma _{\text {asgn}}})\) for a randomly drawn \(\sigma _{\text {asgn}}\sim \text {Unif}(S)\), while in the permutation test in Eq. 7, the observed statistic is T(X) (i.e., using the fixed permutation Id in place of a randomly drawn σasgn). For this reason, the randomization test p-value in Eq. 4.1 is valid, while the permutation test calculation in Eq. 2.3 is not valid in general.

Now we again consider Hemerik and Goeman (2018)’s method using a fixed subset. This test in Eq. 2.2 is a permutation test, not a randomization test—the observed data X, and its corresponding statistic T(X), do not arise from a random treatment assignment. More generally, our proposed test in Eq. 2.1 using an arbitrary distribution q on \(\mathcal {S}_{n}\) is again a permutation test rather than a randomization test—that is, the observed data is given by X itself, not by a randomly chosen treatment assignment \(X_{\sigma _{\text {asgn}}}\) for \(\sigma _{\text {asgn}}\sim q\). Nonetheless, we are able to produce a valid p-value without assuming an underlying group structure or uniform sampling for the permutations considered by the test.

4.2 Exchangeable MCMC

The result of Theorem 2.2, which allows for random samples drawn from an arbitrary distribution q on \(\mathcal {S}_{n}\), is closely connected to Besag and Clifford (1989)’s well known construction for obtaining exchangeable samples from Markov chain Monte Carlo (MCMC) sampling.

Consider a distribution Q0 on \(\mathcal {Z}\), and suppose we want to test

$$H_{0}: \ Z\sim Q_{0}$$

with some test statistic T(Z). To find a significance threshold for T(Z), we would ideally like to draw from the null distribution, i.e., compare T(Z) against \(T(Z_{1}),\dots ,T(Z_{M})\) for \(Z_{1},\dots ,Z_{M} \overset {iid}{\sim } Q_{0}\). However, in many settings, sampling directly from Q0 is impossible, but we instead have access to a Markov chain whose stationary distribution is Q0. If we run the Markov chain initialized at Z to obtain draws \(Z_{1},\dots ,Z_{M}\) (say, running the Markov chain for some fixed number of steps s between each draw), then dependence among these sequentially drawn samples means that \(Z,Z_{1},\dots ,Z_{M}\) are not i.i.d., and are not even exchangeable. Without studying the mixing properties of the Markov chain, we cannot determine how large the number of steps needs to be for the dependence to become negligible. Instead, Besag and Clifford (1989) propose a construction where the samples are drawn in parallel (rather than sequentially), which ensures exchangeability:

Theorem 4.1 (Besag and Clifford (1989, Section 2)).

Let Q0 be any distribution on a probability space \(\mathcal {Z}\). Construct a Markov chain on \(\mathcal {Z}\) with stationary distribution Q0, whose forward and backward transition distributions (initialized at \(z\in \mathcal {Z}\)) are denoted by \(Q_{\rightarrow }(\cdot |z)\) and \(Q_{\leftarrow }(\cdot |z)\). Let \(Q_{\rightarrow }^{s}(\cdot |z)\) and \(Q_{\leftarrow }^{s}(\cdot |z)\) denote the forward and backward transition distributions after running s steps of the Markov chain, for some fixed s ≥ 1. Given an initialization Z, suppose we generate data as in the left plot of Fig. 2:

$$\begin{cases}\text{First, draw} Z_{*} \sim Q^{s}_{\leftarrow}(\cdot|Z);\\ \text{Then, draw} Z_{1},\dots,Z_{M}\overset{iid}{\sim} Q^{s}_{\rightarrow}(\cdot|Z_{*}).\end{cases}$$

If it holds marginally that \(Z\sim Q_{0}\), then the draws \(Z,Z_{1},\dots ,Z_{M}\) are exchangeable.

Figure 2
figure 2

Left: Besag and Clifford (1989)’s parallel construction (with s = 1). Right: the construction used in Theorem 2.2

Given this exchangeability property, the quantity is then a valid p-value for testing \(H_{0}: Z\sim Q_{0}\).

Now we will see how Theorem 2.2 is related to this result. Let \(\mathcal {Z} = \mathcal {X}^{n}\), and let Q0 be any exchangeable distribution. In the setting of this paper, we do not know Q0 precisely, which makes it a bit different from a typical setting where Besag and Clifford (1989)’s method is applied. However, we will work with a Markov chain for which any exchangeable distribution Q0 is stationary, and in fact, Theorem 4.1 holds regardless of whether Q0 is the unique stationary distribution for the Markov chain.

Consider the Markov chain given by applying a randomly chosen permutation \(\sigma \sim q\), that is, for \(x = (x_{1},\dots ,x_{n})\),

$$ Q_{\rightarrow}(\cdot|x) = \sum\limits_{\sigma \in S} q(\sigma) \cdot\delta_{x_{\sigma}}, $$

where \(\delta _{x_{\sigma }}\) is the point mass at xσ, while the backward transition probabilities are given by

$$ Q_{\leftarrow}(\cdot|x) = \sum\limits_{\sigma \in S}q(\sigma) \cdot\delta_{x_{\sigma^{-1}}}. $$

Then, to implement the test described in Theorem 2.2, we run Besag and Clifford (1989)’s method (with s = 1): we define \(X_{*} = X_{\sigma _{0}^{-1}}\), and then define \(X_{m} = (X_{*})_{\sigma _{m}} = X_{\sigma _{m}\circ \sigma _{0}^{-1}}\) for \(m=1,\dots ,M\). This is illustrated on the right-hand side of Fig. 2. If X is exchangeable (that is, it is drawn from some exchangeable Q0), then the exchangeability of \(X,X_{1},\dots ,X_{M}\) follows by Theorem 4.1, and this verifies that P is a valid p-value, thus completing the proof of Theorem 2.2.

Of course, we have only written out our method for the s = 1 case (where s is the number of steps of the Markov chain). New variants of our method can be constructed by taking s > 1 backward steps to the hidden node, and the same number s of forward steps to the permuted data. All of these are valid for the same reason as the s = 1 case.

5 Conclusion

We proposed a new method for permutation testing that generalizes previous methods. This idea naturally opens up new lines of theoretical and practical enquiry. In this work, we have focused on validity, but it is of course also important to examine the consistency and power of such methods. In particular, Dobriban (2022) and Kim et al. (2022) study the power of the permutation test when using the full permutation group \(\mathcal {S}_{n}\); it would be interesting to examine this question in the context of using only a subset \(S\subseteq \mathcal {S}_{n}\) or a nonuniform distribution over \(\mathcal {S}_{n}\). In addition, the theoretical guarantees for all the permutation tests considered here ensure a p-value P that is valid in the sense of satisfying \(\mathbb {P}_{H_{0}}\left \{P\leq \alpha \right \}\leq \alpha \), which means that P could potentially be quite conservative under the null (for instance, we saw this behavior when ‘fixing’ the failure example in Section 2.1). It would also be interesting to understand which types of tests reduce overly conservative outcomes.

In conclusion, it is perhaps remarkable that one can still gain new understanding about classical permutation methods. In turn, this enhanced understanding can inform other areas of inference. As an example, the results from this paper were motivated by questions in conformal prediction (Vovk et al. 2005), a method for distribution-free predictive inference. Classically, conformal prediction has relied on exchangeability of data points (e.g., training and test data are drawn i.i.d. from the same unknown distribution), and thus the joint distribution of the data (including both training samples and a test point) is invariant under an arbitrary permutation. In contrast, in our recent work (Barber et al. 2023), we studied the problem of constructing prediction intervals when the data do not satisfy exchangeability; for instance, the distribution of observations may simply drift over time in an unknown fashion. Thus the data is no longer invariant under an arbitrary permutation, and so we instead restrict attention to a weighted distribution over simple permutations that only swap the test point with a random training point, which at least approximately preserve the distribution of the data. These swaps clearly do not form a subgroup of permutations, and are weighted non-uniformly; understanding how permutation tests operate in this setting, as in Theorem 2.1, is key to the findings in our aforementioned work.