High-dimensional randomization-based inference capitalizing on classical design and modern computing

A common complication that can arise with analyses of high-dimensional data is the repeated use of hypothesis tests. A second complication, especially with small samples, is the reliance on asymptotic p-values. Our proposed approach for addressing both complications uses a scientifically motivated scalar summary statistic, and although not entirely novel, seems rarely used. The method is illustrated using a crossover study of seventeen participants examining the effect of exposure to ozone versus clean air on the DNA methylome, where the multivariate outcome involved 484,531 genomic locations. Our proposed test yields a single null randomization distribution, and thus a single Fisher-exact p-value that is statistically valid whatever the structure of the data. However, the relevance and power of the resultant test requires the careful a priori selection of a single test statistic. The common practice using asymptotic p-values or meaningless thresholds for “significance” is inapposite in general.


Introduction
Published data analyses from some fields rich with data, such as genomics and epigenomics, often fail to report valid and informative p-values when faced with many tests (e.g., Zeilinger et al. 2013;Morozova et al. 2015). Instead, some reported alternatives aim to "adjust" for a plethora of calculated p-values, using crudely applied Bonferroni adjustments (1936), which are typically grossly conservative (Perneger 1998).
In contrast, we describe a procedure for generating a single null randomization distribution for a large set of sharp null hypotheses, and thus a single Fisher-exact p-value. To be helpful, the resulting p-value depends critically on a priori choosing a revealing scalar test statistic, that is sensitive to scientifically interesting departures from the sharp null hypothesis, which is typically that there is absolutely no effect of treatments being studied on any of the outcomes. Making this choice of test statistic generally makes demands on the researcher to think carefully before attacking the data, a task which is explored in the context of our example. This approach also relies on modern computing environments to implement, but the demands on the researcher to think before computing is more critical. This is especially true in the current environment of ever-increasing computational power, which allows many superficial exploratory analyses. Such analyses, although apposite to conduct, should be postponed until after confirmatory analyses have been completed. We illustrate the essential idea in the context of a randomized crossover experiment, where we compare two treatment conditions on a multi-component epigenetic outcome measured at almost half a million epigenomic locations.

Epidemiological context
Epidemiological studies in humans have reported associations of ozone exposure with mortality (Bell et al. 2004;Jerrett et al. 2009;Turner et al. 2016), as well as with epigenetic marks (e.g., Bind et al. 2014), but better support for causal relationships comes from studies with controlled exposures (Devlin et al. 2012). Suggestive results have been seen in a randomized controlled study with rats (Miller et al. 2018). However, there is the desire for studies exploring evidence for causality between ozone exposure and the DNA methylome in humans using randomized experiments.

The randomized crossover experiment
The epigenetic study used here has been previously described ). Briefly, a total of seventeen healthy young adults (indexed by i = 1, … , N ) were exposed on two occasions, separated by at least thirteen days, once to clean air and once to ozone, randomly ordered (see Table 1), for two hours while exercising intermittently. We denote the ith participant's treatment assignments at periods j = 1 and j = 2 by the two-component row vector After each exposure, each participant underwent bronchoscopy during which epithelial cells were removed and used to measure DNA methylation on each of 484,531 CpG sites along the genome with the Illumina 450K platform. 1 For each participant, the scalar outcome Y = "DNA methylation on CpG sites" is defined as the proportion of methylated cytosines over all methylated and unmethylated cytosines.
In this experiment, the exposure factor thus has two levels, clean air and ozone, and each participant is randomized at period j = 1 to one or the other treatment level, and the exposure level at period j = 2 differs from the exposure level at period j = 1 for all units. A randomized crossover experiment is a fractional design of a more general randomized experiment, where both exposure factors have two levels, clean air and ozone, are randomized at period j = 1 and at period j = 2.
The objective is to learn whether ozone exposure has an effect on the DNA methylome. First step is to understand DNA methylation under clean air, and then to anticipate deviations under ozone exposure.

The DNA methylome under clean air: a set of dependent outcomes
Common analyses of the DNA methylome typically assume its myriad components are independent.
DNA methylation appears to act as a "switch" that controls gene expression because its distribution across regions is bimodal with most regions either highly methylated or essentially unmethylated (Bennett and Hasty 2007). This feature is also seen in our data: Fig. 1 presents the empirical density of the sample means of methylation under clean air (whether at first or second period) averaged over the seventeen participants at each of the 484,531 CpG sites (which are indexed by k ∈ {1, … , 484,531} and ordered by chromosome and genomic location). This figure reveals a bimodal distribution of means over seventeen participants across CpG sites, which is also seen at the participant-level.
(W i,j=1 = 0, W i,j=2 = 1) Exposed to clean air Exposed to ozone 10 (W i,j=1 = 1, W i,j=2 = 0) Exposed to ozone Exposed to clean air With such a high-dimensional outcome, the typical hope is to capitalize on some independence or at least some exchangeability somewhere. However, highdimensional epigenomic outcomes are relatively complicated and consist of possibly correlated DNA methylation measurements at the 484,531 CpG sites. To explore this possible correlational structure, Fig. 2 presents the distributions of the upper triangle of the matrix of all pairwise correlations between: (1) measurements of the first 1000 "adjacent" CpG sites on chromosome 1 measured by the 450K array under clean air (left panel), and (2) 1000 randomly drawn CpG sites under clean air (right panel). The number of elements in the upper triangle of the correlation matrices is thus 1000 * 999 2 = 499,500. For comparison with simple randomness, Fig. 3 presents the simulated distribution under the null hypothesis of no correlation between 1000 normal random variables: We drew 499,500 independent variables such that their squared values follow a Beta( 1 2 , N−2 2 ) distribution, which is the null distribution of the estimated squared-correlation coefficient between two independent normally distributed random variables.
Based on Figs. 2 and 3, it is difficult to assess visually whether levels of DNA methylation on CpG sites are uncorrelated in our data set. It appears that the right tail of the empirical correlation distributions (Fig. 2) may have longer tails than the distribution assuming normality and independence, which is displayed in Fig. 3. Figure 4 presents Q-Q plots comparing the correlations under normality and independence, i.e., ± √ Beta( 1 2 , 17−2 2 ) to the empirical correlations between: (1) the first 1000 adjacent CpG sites on chromosome 1 under clean air, and (2) the 1000 randomly drawn CpG sites under clean air, respectively. We enlarge the plots at both extremes to focus on the tails. The extreme tails do not follow a 45 • line, suggesting that, even under clean air, there are more extreme correlations than expected under normality and independence, suggesting that the joint assumption, normality and independence, is suspect for those data, and thus pursuing analyses based on the usual normal and independence assumptions could be deceptive.

Typical Bonferroni correction applied to the Fisher exact p-value
In biomedical research, it is common to have studies like this one, with a small number of participants compared to the number of outcomes; such data sets are sometimes referred as "small N, big P data". In such settings, it is common to control the family-wise error rate (FWER) of statistical tests, the probability of making one or more false "discoveries" (also called false positives or type I errors). Let be the statistical significance level for each of the K statistical tests we perform of the sharp null hypothesis. The Bonferroni "correction" (Bonferroni 1936) states that we should declare as "significant", or more accurately possibly "worth further scrutiny" (Edgeworth 1885;Fisher 1925;Wasserstein et al. 2019), only those p-values that are less than or equal to K , and that doing so will control the FWER to be less or equal to .
For example, in a randomized crossover experiment with N participants and two treatment levels (i.e., either W i = (0, 1) or W i = (1, 0) with our notation), the smallest Fisher-exact p-value that can be obtained is equal to the inverse of the total number of possible randomizations, that is, for a 50-50 Bernoulli assignment mechanism with N units, 1 2 N . In our example with N = 17 individuals, even assuming a Bernoulli assignment mechanism (which maximizes the number of possible randomizations and therefore minimizes the possible Fisher-exact p-value), the minimal Fisher-exact p-value that can be obtained from one test is equal to 1 2 17 = 1 131,072 ≈ 7.63 × 10 −6 . The associated Bonferroni "correction factor" in this example with K = 484,531 tests is equal to 0.05 484,531 ≈ 1.03 × 10 −7 , which reveals that, a priori, the Bonferroni adjustment offers zero power in this experiment because all such "adjusted" p-values exceed 1! Therefore, this commonly-used adjustment method has no practical value in situations like our example. In a similar crossover study with N = 10 individuals (Zhong et al. 2017), the minimal attainable Fisher-exact p-value that can be obtained (assuming a Bernoulli assignment mechanism), would be equal to 1 2 10 = 1 1,024 ≈ 10 −3 . Because all "Bonferroni-adjusted" p-values exceeded 1, the authors lowered the significance threshold to an arbitrarily value of 10 −4 , which was inapposite, not only because of the FWER, but also because the chosen threshold was below the minimum attainable Fisher-exact p-value. There must be a better way for us to attack this problem!

Notation
The following sections may appear overly tedious with respect to their notations. However, we believe that the statistical issues described here have been imprecisely defined, especially in crossover studies.

Potential outcomes and assumptions
As in Table 1, let W i = (W i,j=1 , W i,j=2 ) denote the assignment of exposures for participant i, i ∈ {1, … , N = 17} . After exposure to a treatment at each period j, for each participant i, a multivariate outcome measurement with dimension K, generically denoted , is taken. We use the notation of potential outcomes to define the possible observed values in this experiment (Neyman 1990;Rubin 1974Rubin , 1990. Under the stable unit treatment value assumption (SUTVA) (Rubin 1980), the values of the four K-dimensional potential outcome measurements that could be observed for par- Table 2.

Causal estimands
Possible causal estimands of interest include the unit-level causal effect of assignment to ozone versus assignment to clean air at period j on the K-dimensional outcome: and The K-dimensional participant-level, period averaged, causal effect of the exposure to ozone vs. exposure to clean air on the high-dimensional outcome is i = i,j=1 + i,j=2 2 , and the grand average causal effect across participants is = 1 17 ∑ 17 i=1 i . Of course, none of these causal estimands can be directly observed (Rubin 1978;Holland 1986).
Here, for i,j=2 to be an easily-interpretable estimand, we assume that the time between periods j = 1 and j = 2 is adequate to "wash out" the possible effect of the ozone exposure applied at period j = 1 on the outcome measured at period j = 2 , i.e., . This assumption has been commonly referred in the experimental design literature as the "no carryover effect" assumption. Note that, for easy interpretation, we implicitly assume no "placebo effect" for each unit, which is plausible with our data, because the participants were blinded to their exposures.
In a randomized crossover experiment with the assumptions of no carry-over and no time effects, other participant-level contrasts of potential outcomes could be of scientific interest, such as (W i,j=1 = 0) for participants exposed first to clean air, and OZ,CA 0) for participants exposed first to ozone. We will refer to these seventeen participant-level estimands of ozone exposure "minus" clean air as � i = CA,OZ i + OZ,CA i 2 , and their average across participants as ′ .

Observed outcomes
Each K-dimensional unit-level observed outcome measured for participant i at period j, y obs i,j , is a function of the treatment assignment W i,j and the potential outcomes at period j: and Similarly, we define the participant-level observed outcome under clean air as: The observed methylation mean under clean air at site k and the estimated standard deviation under clean air at site k are ̂k CA , respectively. The observed participant-level outcome at site k, y obs,k i , can be expressed as (y obs,k i,j=1 , y obs,k i,j=2 ) or as (y obs,k CA,i , y obs,k OZ,i ) in obvious, but not identical notations.

Naive sharp null hypothesis
The Fisher sharp null hypothesis ( H 00 ) in this setting asserts that, for each participant i, there is no effect of exposure assigned at any period on any of the vector outcomes Y i,j measured at any period j, that is 2 : Note that in the more general randomized experiment where the assignment is randomized at both periods j = 1 and j = 2 , the corresponding Fisher's sharp null hypothesis would include two additional statements: and In the context of our randomized crossover experiment, these last two statements can be omitted, because they have no empirical consequences: neither

Tested sharp null hypothesis
The sharp null hypothesis that we are actually testing in this randomized crossover experiment contains two additional statements, i.e., there are no time and no carryover effects. If we obtain evidence against the sharp null hypothesis, we would have to consider both additional assumptions, because such evidence is formally evidence against any aspect of the null hypothesis.

Construction of a summarizing test statistic
To make each component of the K-component outcome comparable, we scale the k th component of the unit-level outcome measured at period j, y obs,k i,j , using the estimated methylation mean under clean air, ̂C A , and the estimated standard deviation under clean air ̂C A , both described previously: Notice that y obs,k scaled,i,j is a function of the clean air measurements for all units, a function of the observed randomization, and therefore, not a purely scientific estimand.
Assuming no carry-over and no time effects, the k th component of ′ can be estimated by regressing each vector of 34 unit-level observed outcomes on each participant's treatment level at period j, thereby leaving us with K = 484,531 estimated regression coefficients. Here, because the dimension of the outcome Y is large compared to the number of units, to conduct a randomization-based test, we need a scalar test statistic. We consider what is often called "regularization" approaches.
We choose a test statistic that reflects a fairly-popular approach used in biomedical research (e.g., Suchting et al. 2019), defined by the following two-step procedure: Step 1: Let us consider a "backward and unnatural" logistic regression using 34 units (i.e., two time points per participant), which is sometimes referred as "reverse regression" (Conway and Roberts 1983): Our use of this statistic is not advocating the use of the "reverse regression" approach for estimating causal effects on an outcome, but rather simply illustrating how to construct a possibly useful scalar test statistic using a popular way of examining treatment-outcome associations in high-dimensional settings. Notice that here, this "backward" logistic regression is only used for defining our test statistic. Other approaches for choosing test statistics, for instance, based on subject-matter knowledge, are likely to be at least as scientifically revealing. The main motivation for this choice is to shrink the estimated regression coefficients ̂k using a Lasso-type of regularization technique.
Specifically, we consider the generalized Elastic-Net regularization (Friedman et al. 2010), which recently has become a popular penalization strategy (e.g., Suchting et al. 2019). This regularization combines Lasso-type (i.e., by adding a first penalty term 1 484,531 ∑ k=1 � k � ), and Ridge-type (i.e., by adding a second penalty term 2 484,531 penalizations, where 1 and 2 are two tuning parameters. As is sometimes recommended, we choose the tuning parameters that minimize the Bayesian Information Criterion (BIC) using a two-dimensional grid search. These two penalties aim to shrink the "irrelevant" regression coefficients towards zero (Fan and Peng 2004). Note that some epigenomic studies have stopped at this step (e.g., Yoon et al. 2017), i.e., after the regularization method is performed, with authors reporting the estimated "true" (i.e., non-zero) ̂k 's. Again, here, we are not advocating the Elastic-Net algorithm. The main contribution of our approach is delineating the steps needed to provide a valid and possibly-informative Fisher-exact p-value, which does not rely on asymptotic arguments. Future work should perform a simulation study varying, for example, the distribution of the high-dimensional potential outcomes, sample size, test statistic, and include power considerations.
Step 2: With the selected ̂k 's from Step 1, we construct the following scalar test statistic, which should be sensitive to the number of selected CpG sites and their effect sizes. There are many ways of constructing a test statistic in such a big data setting. To obtain a small-sample valid Fisher-exact p-value, any strategy needs to summarize high-dimensional empirical evidence by a scalar.

Randomization-based causal inference
The actual assignment mechanism could not be precisely recovered . Consequently, we assume a Bernoulli mechanism with probability 0.5, which leads to 2 17 possible randomized allocations with our data. Using the Harvard FAS cluster, we repeat the two-step procedure just described 2 17 times assuming the sharp null hypothesis H 00 , and thereby construct the null randomization distribution of T. The Fisher-exact p-value is the proportion of values of T calculated under H 00 that are as extreme or more extreme than the observed value of T. Note that the statistic T varies across randomizations, because of the selected non-zero ̂k 's, their number, their values, and their corresponding CpG sites.

Results
The Elastic-Net procedure selected thirteen CpG sites (see Table 3) with the actual data. Among these thirteen CpG sites, eight were related to the following functional genes: Recall that such a listing is the typical conclusion in current publications (e.g., Suchting et al. 2019). The implicit conclusion appears to be that these sites could be further examined in future investigations.
The distribution of the 2 17 values of the test statistic T under the sharp null hypothesis is shown in Fig. 5, where the associated Fisher-exact p-value is 16,203 2 17 ≈ 0.12 ; 16, 203 is the number of values of T that are greater or equal to the observed value of T.

What if Fisher had access to high-performance computing?
This general procedure for a randomized experiment, we believe first described explicitly in Fisher (1925), is compatible with any test statistic and generates an associated valid p-value. To our knowledge, our initial version of this article is the first time that regularization methods are coupled with randomization-based inference to obtain a valid Fisher-exact p-value that does not rely on any asymptotic arguments. We speculate that randomization-based inference was not emphasized in Fisher's earlier discussions because of the lack of high-performance computing during most of the 20th century. However, we now have computing capabilities to display the null randomization distributions, and we should do so to avoid relying on asymptotic approximations, which can be misleading .

Display the null randomization distribution
Displaying a null randomization distribution may provide valuable scientific insights, as already argued by us . Here, we displayed a possiblyinsightful null randomization distribution. First, this null randomization distribution has two major components: A point mass at zero corresponding to 110,403 allocations (i.e., ≈ 84% ), indicating that under the null hypothesis, most values of T would be zero, and the remaining randomized allocations that led to positive values of the test statistic with a maximum around 0.6. The location of the observed test statistic, T, is away from the point mass at zero but not near the most extreme null value of the test statistic. Again, future work should perform a simulation study varying, e.g., the high-dimensional potential outcomes, sample size, test statistic, and comparing characteristics of the resulting null randomization distributions.

Selection methods
Here, we provide a non-parametric, non-asymptotic, approach to obtain a valid Fisher-exact p-value; in our context, this p-value was around 0.12, which suggests only quite modest evidence about any causal effects of ozone (vs. clean air) on DNA methylome. Although we chose to use the Elastic-Net procedure to construct our test statistic, other algorithms could have been used instead, such as the ones proposed by Li et al. (2015) (which selects non-zero ̂k through "a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix") or the one proposed by Cox and Battey (2017) (which capitalizes on partially balanced incomplete block designs to select several choices of sparse non-zero ̂k that fit as well as the Lasso's choice). Cox and Battey's choices of sparse non-zero ̂k would also need to be combined into a scalar test statistic. One idea would be to use a more general test statistic: where k ,m represents the kth coefficient for a model m that has essentially the same fit as the Lasso procedure.

Limitations and strengths
Our example certainly has limitations. Our target experiment relying on seventeen human subjects with two outcome measurements as 34 units has low power because of the small number of units, even with no carry-over and no time effects between visits, assumptions which may not be plausible in our experiment. The literature is limited on both issues. Because the order of exposure is randomized, we expect important covariates (e.g., ambient air pollution exposure, dietary changes) to be balanced between groups, but this is not guaranteed. Regarding a potential carryover effect of ozone on the DNA methylome, we believe this should be studied in animal and/or in vitro studies before conducting such crossover experiments with human participants. If this were the case, one should take into account the sequential feature of the crossover experiments and allow for the estimation of the carry-over effects, for example, using a model to estimate appropriate causal effects. However, for this effort to be helpful would require a larger number of participants than in this study. Despite these limitations, however, we believe that this paper possesses some important messages. Because of the randomized treatment assignment induced by the study design, it is conceptually straightforward to implement a randomizationbased test, and thanks to currently available high-performance computing, we can do this exactly. Also, if the "no carry-over and no time effects" assumptions hold, it would be natural to report that we found some weak evidence about an effect of ozone exposure on the DNA methylome.

Computational considerations
For a single exposure allocation, the Elastic-Net procedure takes 100 seconds on a 2020 personal laptop, which would mean about 61 days for all 2 17 allocations. Instead, we capitalized on parallel computing (i.e., 100 allocations per batch) available on the 2020 Harvard FAS cluster to construct the null randomization of the test statistic.

Epilogue
Future directions on this topic include the extension of this procedure to non-randomized experiments. In this case, we would have to posit an assignment mechanism (Bind et al. 2014) and a class for sensitivity analysis ; assuming the unconfoundedness assumption holds, the rest of the procedure is straightforward.
An alternative to the sharp null hypothesis testing is to use Bayesian model-based inference and explicitly impute the missing high-dimensional potential outcomes (Rubin 1978). The first step could be to model each potential outcome under clean air as mixtures of a finite number of distributions, and similarly for each potential outcome under ozone.
Here, using a significance level of 0.05, we would fail to reject the sharp null hypotheses stating that the effect of ozone vs. clean air on all 484,531 components of the epigenomic outcome is zero, but would not accept this sharp null hypothesis. We believe that it would be interesting to contemplate the multivariate counternull values (Rosenthal and Rubin 1994) (or sets) that, by definition, have the same randomization-based evidence as the null values we obtained here; this is a computational challenge that we hope to address.