Abstract
Testing whether a variable of interest affects the outcome is one of the most fundamental problems in statistics and is often the main scientific question of interest. To tackle this problem, the conditional randomization test (CRT) is widely used to test the independence of variable(s) of interest (X) with an outcome (Y) holding other variable(s) (Z) fixed. The CRT uses “Model-X” inference framework that relies solely on the iid sampling of (X, Z) to produce exact finite-sample p values that are constructed using any test statistic. We propose a new method, the adaptive randomization test (ART), that tackles the same independence problem while allowing the data to be adaptively sampled. Like the CRT, the ART relies solely on knowing the (adaptive) sampling distribution of (X, Z). Although the ART allows practitioners to flexibly design and analyze adaptive experiments, the method itself does not guarantee a powerful adaptive sampling procedure. For this reason, we show substantial power gains obtained from adaptively sampling compared to the typical iid sampling procedure in a multi-arm bandit setting and an application in conjoint analysis. We believe that the proposed adaptive procedure is successful because it takes arms that may initially look like “fake” signals due to random chance and stabilizes them closer to “null” signals and samples more/less from signal/null arms.
Similar content being viewed by others
Notes
Neither the CRT nor our paper needs to assume the existence of the pdf. However, for clarity and ease of exposition, we present the data generating distribution with respect to a pdf.
We remind the reader that p is used to denote the cardinality of \(\mathcal {X}\) as opposed to the dimension of \(\mathcal {X}\).
\(q^{\star }\) is not formally the most optimal iid sampling procedure for all possible iid sampling procedure since we consider the maximum power when only varying \(q_1\) while imposing the remaining arms to all have equal probabilities. However, we do not imagine any other reasonable iid sampling procedure to have a stronger power than \(q^{\star }\) since the remaining \(p-1\) arms with no signals are not differentiable in any way; thus, we lose no generality by setting them with equal probability.
References
Arrow KJ (1998) What has economics to say about racial discrimination? J Econ Perspect 12(2):91–100. https://doi.org/10.1257/jep.12.2.91
Ash RB, Doleans-Dade CA (1999) Probability and measure theory, 2nd edn. Harcourt/Academic Press, Burlington, MA
Bates S, Sesia M, Sabatti C, Candès E (2020) Causal inference in genetic trio studies. Proc Natl Acad Sci 117(39):24117–24126. https://doi.org/10.1073/pnas.2007743117
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57(1):289–300
Berrett T, Wang Y, Barber R, Samworth R (2019) The conditional permutation test for independence while controlling for confounders. J R Stat Soc: Ser B (Stat Methodol). https://doi.org/10.1111/rssb.12340
Bojinov I, Shephard N (2019) Time series experiments and causal estimands: exact randomization tests and trading. J Am Stat Assoc
Candès E, Fan Y, Janson L, Lv J (2018) Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. J Roy Stat Soc B 80(3):551–577
Farronato C, MacCormack A, Mehta S (2018) Innovation at uber: the launch of express pool. Harvard Business School Case) 82
Glynn P, Johari R, Rasouli M (2020) Adaptive experimental design with temporal interference: a maximum likelihood approach. https://doi.org/10.48550/ARXIV.2006.05591
Hainmueller J, Hopkins DJ (2015) The hidden American immigration consensus: a conjoint analysis of attitudes toward immigrants. Am J Polit Sci. https://doi.org/10.1111/ajps.12138
Ham DW, Imai K, Janson L (2022). Using machine learning to test causal hypotheses in conjoint analysis. https://doi.org/10.48550/arXiv.2201.08343
Imbens GW, Rubin DB (2015) Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press, Cambridge
James W, Stein C (1961) Estimation with quadratic loss Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1: Contributions to the Theory of Statistics. Univ. of California Press, Berkeley, CA
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6(1):4–22. https://doi.org/10.1016/0196-8858(85)90002-8
Le Cam L (1956) On the asymptotic theory of estimation and testing hypotheses. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp 129–156. University of California Press
Liu JS (2008) Monte Carlo strategies in scientific computing. Springer, New York, Berlin, Heidelberg, p 344
Luce RD, Tukey JW (1964) Simultaneous conjoint measurement: a new type of fundamental measurement. J Math Psychol 1(1):1–27. https://doi.org/10.1016/0022-2496(64)90015-X
Lupia A, Mccubbins M (2000) The democratic dilemma: Can citizens learn what they need to know? Am Polit Sci Rev. https://doi.org/10.2307/2586046
Offer-Westort M, Coppock A, Green DP (2021) Adaptive experimental design: prospects and applications in political science. Am J Polit Sci 65(4):826–844. https://doi.org/10.1111/ajps.12597
Ono Y (2018). Replication Data for: The contingent effects of candidate sex on voter choice. https://doi.org/10.7910/DVN/IZKZET
Ono Y, Burden BC (2018) The contingent effects of candidate sex on voter choice. Polit Behav 1–25
Rosenberger WF, Uschner D, Wang Y (2019) Randomization: the forgotten component of the randomized clinical trial. Stat Med 38(1):1–12. https://doi.org/10.1002/sim.7901
Shi C, Xiaoyu W, Luo S, Zhu H, Ye J, Song R (2022) Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework. J Am Stat Assoc. https://doi.org/10.1080/01621459.2022.2027776
Skarnes W, Rosen B, West A, Koutsourakis M, Roake W, Iyer V, Mujica A, Thomas M, Harrow J, Cox T, Jackson D, Severin J, Biggs P, Fu J, Nefedov M, de Jong P, Stewart A, Bradley A (2011) A conditional knockout resource for the genome-wide study of mouse gene function. Nature 474:337–42. https://doi.org/10.1038/nature10163
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. A Bradford Book, MIT press, Cambridge, MA
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294. https://doi.org/10.1093/biomet/25.3-4.285
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288
Wainwright MJ, Jordan MI, et al. (2008) Graphical models, exponential families, and variational inference. Foundations Trends® in Machine Learning 1(1–2):1–305
Wu J, Ding P (2021) Randomization tests for weak null hypotheses in randomized experiments. J Am Stat Assoc 116(536):1898–1913. https://doi.org/10.1080/01621459.2020.1750415
Acknowledgements
We thank Lucas Janson, Iavor Bojinov and Subhabrata Sen for advice and feedback.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Asymptotic results for the normal-means model
The two asymptotic power analysis results we omitted for conciseness of the main text in Sect. 3.2 are stated here. Proofs of results presented in this section are all in Appendix E.
Theorem A.1
(Normal-Means Model: Power of RT under iid sampling procedures) Upon taking \(B \rightarrow \infty \), the asymptotic power of the iid sampling procedure with probability weight vector \(q = (q_1,q_2,\cdots ,q_p)\), as defined in Definition 3.1, with respect to the RT with the “maximum ” test statistic, is equal to
where \(z_{1 - \alpha }\) is the \(1 - \alpha \) quantile of the distribution of \(\tilde{T}_{\text {iid}}\). \(T_{\text {iid}}\) and \( \tilde{T}_{\text {iid}}\) are defined/generated as a function of \(G:= (G_1, G_2, \dots , G_{p-1})\) and \(H:= (H_1, H_2, \dots , H_{p-1})\), both of which are independent and follow the same \((p-1)\)-dimensional multivariate Gaussian distribution \(\mathcal {N} \left( 0, \Sigma (q) \right) \). \(T_{\text {iid}}\) and \(\tilde{T}_{\text {iid}}\) are then defined as
Finally, \(\Sigma (q)\) is specified by
where matrices \(\Sigma _0\) and D are defined as
with \(v(x) = x(1-x)\), and \(D(q):= \text {diag}(q_1,q_2,\dots ,q_{p-1}) \in \mathbb {R}^{(p-1)\times (p-1)}\).
We note that if we assume p to be “large ” (in a generic sense) and our sampling probabilities \(q_j = O(1/p)\) for all j, then the diagonal elements of \(\Sigma (q)\) will be generally much larger than the off-diagonal elements. Consequently, G and H in Theorem A.1 will have approximately independent coordinates; thus, both \(T_{\text {ind}}\) and \(\tilde{T}_{\text {iid}}\) are characterized by nearly independent Gaussian distributions.
By an argument similar to proof for Theorem A.1, we can also derive the asymptotic power for our two-stage adaptive sampling procedures.
Theorem A.2
(Normal-Means Model: Power of the ART under two-stage adaptive sampling procedures) Upon taking \(B \rightarrow \infty \), the asymptotic power of a two-stage adaptive sampling procedures with exploration parameter \(\epsilon \), reweighting function f, scaling parameter t and test statistic T as defined in Definition 3.2, with respect to the ART with the “maximum ” test statistic, is equal to
where \(z_{1-\alpha }(\tilde{T}_{\text {adap},j}\mid R^{\text {F}},R^{\text {S}}, H^{\text {F}}, H^{\text {S}})\) denotes the \(1-\alpha \) quantile of the conditional distribution of \(\tilde{T}_{\text {adap}}\) given \( R^{\text {F}}\), \(R^{\text {S}}\), \( G^{\text {F}}\) and \( G^{\text {S}}\). Furthermore,
where \(R^{\text {F}}\), \(R^{\text {S}}\), \(G^{\text {F}}\), \(G^{\text {S}}\), \(H^{\text {F}}\), \(H^{\text {S}}\), Q, \(\tilde{Q}\), W and \(\tilde{W}\) are random quantities generated from the following procedure. First, generate \(R^{\text {F}} \sim \mathcal {N}(0,1)\), \(G^{\text {F}} \sim \mathcal {N}\left( 0, \Sigma (q) \right) \), and \(H^{\text {F}} \sim \mathcal {N}\left( 0, \Sigma (q) \right) \) independently, where \(\Sigma (\cdot )\) is defined in Eq. A1. Second, compute
Third, compute
We note that with a slight abuse of notation, the Q defined here is the asymptotic distributional characterization of Eq. 6. Lastly, generate \(R^{\text {S}}\sim \mathcal {N}(0,1)\), \(H^{\text {S}} \sim \mathcal {N} \left( 0, \Sigma (Q)\right) \) and \(G^{\text {S}} \sim \mathcal {N} \left( 0,\Sigma \left( \tilde{Q} \right) \right) \) independently.
Appendix B: Multiple testing
Our proposed method tests \(H_0\) for a single variable of interest X conditional on other experimental variables Z. However, the practitioner may be interested in testing multiple \(H_0\) for multiple variables of interest (including variables from Z).
To formalize this, denote \(X = (X^1, X^2, \dots , X^{p})\) to contain p variables of interest, each of which can also be multidimensional. Informally speaking, our objective is to perform p tests of \(Y \perp \!\!\! \perp X^{j} \mid X^{-j}\) for \(j = 1, 2, \dots p\), where \(X^{-j}\) denotes all variables in X except \(X^j\). Given a fixed j, our proposed methodology in Sect. 2.1–2.3 can be used to test any single one of these hypothesis. The main issue with directly extending our proposed methodology for testing all \(j = 1, 2, \dots p\) variables is that Assumption 1 does not allow \(X^{-j}\) to depend on previous \(X^{j}\) but \(X^{j}\) may depend on previous \(X^{-j}\) when testing a single hypothesis \(Y \perp \!\!\! \perp X^{j} \mid X^{-j}\). This asymmetry may cause this assumption to hold when testing for \(X^j\) but simultaneously not hold when testing for \(X^{j'}\) for \(j\ne j'\). Thus, in order to satisfy Assumption 1 for all variables of interest simultaneously, we modify our procedure such that each \(X_t^j\) is independent of \(X_{t'}^{j'}\) for all \(j, j'\) and \(t' \le t\). In other words, we force each \(X_t^j\) to be sampled according to its own history \(X_{1:(t-1)}^j\) and the history of the response but not the history and current values of \(X^{j'}\) for \(j \ne j'\) and for every j. We formalize this in the following assumption.
Assumption 2
(Each \(X^j\) does not adapt to other \(X^{j'}\)) For each \(t = 1, 2, \dots , n\) suppose each \(X_t = (X_t^1, X_t^2, \dots , X_t^p)\) are sampled according to a sequential adaptive sampling procedure A: \(X_t \sim f_{t}^A(x_t^1, x_t^2, \dots , x_t^p \mid x^{-j}_{1:(t-1)}, x_{1:(t-1)}^j, y_{1:(t-1)})\). We say an adaptive procedure A satisfies Assumption 2 if \( f_{t}^A \) can be written into the following factorized form, for \(t = 2, 3, \dots , n\),
with every \(f_{t,j}^A(\cdot \mid x^t_{1:(t-1)}, y^t_{1:(t-1)})\) being a valid probability measure for all possible values of \((x^t_{1:(t-1)}, y^t_{1:(t-1)})\).
Assumption 2 states that \(X^{j}\) can not adapt based on the history of any other \(X^{j'}\) for all \(j \ne j'\). This assumption is sufficient to satisfy Assumption 1 when testing \(H_0\) for any \(X^j\) for any \(j = 1, 2, \dots , p\), thus leading to a valid p value for every \(X^j\) simultaneously when using the proposed ART procedure in Algorithm 1. Although our framework gives valid p values for each of the multiple tests, we need to further account for multiple testing issues. For example, one naïve way to control the false discovery rate is to use the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995), but this is not the focus of our paper.
Appendix C: Discussion of the natural adaptive resampling procedure
Keen readers may argue the NARP is merely a practical choice but an unnecessary one, thus no longer requiring Assumption 1. Exchangeability requires \((\textbf{X}, \textbf{Z}, \textbf{Y})\) and \((\tilde{\textbf{X}}^b, \textbf{Z}, \textbf{Y})\) to be equal in distribution. Consequently, if one could sample the entire data vector \(\tilde{\textbf{X}}\) from the conditional distribution of \(\textbf{X} \mid (\textbf{Z}, \textbf{Y})\), then this construction of \(\tilde{\textbf{X}}\) would satisfy the required distributional equality. In general, however, it is well known that it is difficult to sample from a complicated graphical model (Wainwright et al. 2008). To illustrate this, we show how constructing valid resamples \(\tilde{\textbf{X}}^b\) for even two time periods may be difficult without Assumption 1 with the following equations:
This follows directly from elementary probability calculations. Since any valid construction of \(\tilde{\textbf{X}}^b\) must have that \(P(\tilde{X}_1 = x_1, \tilde{X}_2 = x_2 \mid Z_1 = z_1, Z_2 = z_2, Y_1 = y_1, Y_2 = y_2) = P(X_1 = x_1, X_2 = x_2 \mid Z_1 = z_1, Z_2 = z_2, Y_1 = y_1, Y_2 = y_2)\), the above equation shows that it is generally hard to construct valid resamples due to the normalizing constant in the denominator of the second line. We further note that Assumption 1 bypasses this problem because \( P(Z_2 = z_2 \mid X_1=x_1, Y_1 = y_1, Z_1 = z_1) \) is now independent of the condition \( X_1 = x_1 \). Therefore, the denominator in the second line is always \( P(Z_2 = z_2 \mid X_1 = x_1, Y_1 = y_1, Z_1 = z_1)\), canceling out with the numerator.
Although sampling from a distribution that is known up to a proportional constant has been extensively studied in the Markov chain Monte Carlo (MCMC) literature (Liu 2008), many MCMC methods introduce extra computational burden to an already computationally expensive algorithm that requires \(B + 1\) resamples and computation of test statistic T. Moreover, it is unclear how “approximate” draws from the desired distribution in a MCMC algorithm may impact the exact validness of the p values. This problem may be exacerbated when the sample size n is large because the errors for each resamples could exponentially accumulate across time. Therefore, we choose to use the NARP along with Assumption 1 as the proposed method because it avoids these complications.
Appendix D: Proof of main results presented in Sect. 2
Proof of Theorem 2.3
By definition of our resampling procedure, under \(\text {H}_0\),
where the last “\(\overset{\text {d}}{=}\) ” is by the null hypothesis of conditional independence, namely \(X_1 \perp \!\!\! \perp Y_1 \mid Z_1\). Moreover, it also suggests
Then, we will prove the following statement holds for any \(k \in \{ 1,2,\dots ,n\}\) by induction,
Assuming Eq. D3 holds for \(k-1\), we now prove it also holds for k. For simplicity, in the rest of this proof, we will use \(P(\cdot )\) as a generic notation for pdf or pmf, though the proof holds for more general distributions without a pdf or pmf. First,
where (i) is simply by Bayes rule; (ii) is because \(Z_k \perp \!\!\! \perp \tilde{X}_{1:{k-1}} \mid (Y_{1:(k-1)}, Z_{1:(k-1)})\) since \(\tilde{X}_{1:{k-1}}\) is a random function of only \(Y_{1:(k-1)}\) and \(Z_{1:(k-1)}\); and lastly, (iii) is by induction assumption; (iv) is by Assumption 1. Moreover,
where (i) is again simply by Bayes rule; (ii) is because \(Y_k\) is a random function of only \(Z_k\) (up to time k) under the null \(H_0\) and thus is independent of anything with index smaller or equal to k conditioning on \(Z_k\); (iii) is again by Bayes rule; (iv) is by Definition 2.2; and finally (v) is by the previous equation above. Equation D3 is thus established by induction, as a corollary of which, we also get for any \(k \le n\),
Finally, note that \(\tilde{X} \perp \!\!\! \perp X \mid (Y,Z)\). So, conditioning on (Y, Z), \(\tilde{X}\) and X are exchangeable, which means the p value defined in Equation 5 is conditionally valid, conditioning on (Y, Z). Since \(\mathbb {P}\left( p < \alpha \mid Y, Z \right) \le \alpha \) holds conditionally, it also holds marginally. \(\square \)
Proof of Theorem 2.4
Note that Assumption 1 was only utilized once in the proof of Theorem 2.3, namely (iv) of Eq. D4. So upon assuming \((\tilde{X}_{1:k},Y_{1:k}, Z_{1:k}) \overset{\text {d}}{=} (X_{1:k},Y_{1:k}, Z_{1:k})\), we know immediately from Eq. D4 that
which is exactly Assumption 1. \(\square \)
Appendix E: Proof of results presented in Appendix A
Before proving the main power results, we first state a self-explanatory lemma concerning the effect of taking B to go to infinity, which justifies assuming B to be large enough and ignoring the effect of discrete p values like the one defined in Equation 5. Similar proof arguments are made in Wu and Ding (2021); thus, we omit the proof of this lemma. The lemma states that as \(B \rightarrow \infty \), conditioning on any given values of \((X, \textbf{Y},\textbf{Z})\),
Lemma E.1
(Power of ART under \(B \rightarrow \infty \)) For any adaptive sapling procedure A satisfies Definition 2.1 and any test statistic T, as we take \(B \rightarrow \infty \), the asymptotic conditional power of ART (with CRT being an degenerate special case) condition on (Y, Z) is equal to
while the unconditional (marginal) power is equal to
Note that the joint distribution of \((\textbf{X}, \tilde{\textbf{X}}, \textbf{Y}, \textbf{Z})\) is implicitly specified by the sampling procedure A.
Lemma E.2
(Normal-Means Model with iid sampling procedures: Joint Asymptotic Distributions of \(\bar{Y}_j\)’s, \(\tilde{\bar{Y}}_j\)’s and \(\bar{Y}\) Under the Alternative \(\text {H}_1\)) Define
Upon assuming the normal-means model introduced in Sect. 3, under the alternative \(\text {H}_1\) with \(h = h_0 / \sqrt{n}\), as \(n \rightarrow \infty \),
with
where \(G:= (G_1, G_2, \dots , G_{p-1})\) and \(H:= (H_1, H_2, \dots , H_{p-1})\) both follow the same \((p-1)\)-dimensional multivariate Gaussian distribution \(\mathcal {N} \left( 0, \Sigma \right) \) and R is a standard normal random variable. Note that \(\Sigma \) was defined in the statement of Theorem A.1. Moreover, G, H and R are independent.
Remark 1
Roughly speaking, after removing means, R captures the randomness of \(\textbf{Y}\) being sampled from its marginal distribution; H captures the randomness of sampling \(\textbf{X}\) conditioning on \(\textbf{Y}\); lastly, G captures the randomness of resampling \(\tilde{{\textbf {X}}}\) given \(\textbf{Y}\).
Remark 2
We also note that we do not include characterizing the distribution of \(\tilde{\bar{Y}}_{p}\) or \(\bar{Y}_{p}\) to avoid stating the convergence in terms of a degenerate multivariate Gaussian distribution since \(\bar{Y}_p\) is a deterministic function given \(\bar{Y}\) and the remaining \(p-1\) means of the other arms.
Proof of Lemma E.2
We first characterize the conditional distribution of \(\tilde{\bar{Y}}_j\). For any \(j \in \{1,2,\dots , p \}\),
By central limit theorem, since \(\text {Var} \left( Y_i (\textbf{1}_{\tilde{X}_i = j} - q_j ) \right) \rightarrow q_j (1 - q_j)\) as \(n \rightarrow \infty \),
which together with Slutsky’s Theorem and the fact that \(q_j n / \sum _{i=1}^n \textbf{1}_{\tilde{X}_i = j} \rightarrow 1\) almost surely gives,
where \(v(q_j) = \text {Var}(\text {Bern}(\textrm{q}_{\textrm{j}})) = \text {Var}(\textbf{1}_{\tilde{X}_j = 1}) = q_j (1 - q_j)\). Additional to these one-dimensional asymptotic results, we can also derive their joint asymptotic distribution. Before moving forward, we define a few useful notations,
and
By multivariate Lindeberg-Feller CLT (see for instance Ash and Doleans-Dade 1999),
which further gives
because of
Therefore, we have
where
with
Roughly speaking, this suggests that after removing the shared randomness induced by \(\frac{\sum _{i=1}^n Y_i}{\sqrt{n}}\), all the \(\sqrt{n} \tilde{\bar{Y}}_j\)’s are asymptotically independent and Gaussian-distributed.
Next, we turn to \(\bar{Y}_j\). Note that in this part we will view \(X_i\) as generated from \(F_{X \mid Y}\) after the generation of \(Y_i\) according to its marginal distribution. The only difference in the observed test statistic and the above is that we have
with \(q^{\star }_i = (q_{i,1}^{\star },q_{i,2}^{\star },\cdots , q_{i,p}^{\star } )\) and
instead. Again, multivariate Lindeberg-Feller CLT gives
with
Note that, since
and
we have
which further gives
Similar to \(\textbf{J}\)’s, we define \(\textbf{J}^{\star }\)’s as well,
and
which together with Eq. E10 gives
Note that though Eq. E7 and Eq. E11 are almost exactly the same, it does not suggest \(\bar{Y}_j\)’s and \(\tilde{\bar{Y}}_j\)’s have the same asymptotic distribution, since the “mean ” parts that have been removed actually behave differently, namely \(\frac{\sum _{i=1}^n Y_i}{\sqrt{n}}\) and \(\frac{\sum _{i=1}^{n} q_{i,j}^{\star } Y_i}{ q_j \sqrt{n}}\), as demonstrated in Lemma E.3, Lemma E.4, Lemma E.5 and Lemma E.6. Roughly speaking, under this \(\sqrt{n}\) scaling, the randomness that leads to the Gaussian noise part in CLT is the same across them as demonstrated in Eq. E7 and Eq. E11, but the Gaussian distribution they are converging to have different means.
Finally, following exactly the same logic, we can further derive the following joint asymptotic distribution of \(\textbf{J}_{-p,n}\), \(\textbf{J}_{-p,n}^{\star }\) and \(\frac{\sum _{i=1}^n Y_i}{\sqrt{n}}\). Letting
we have
\(\square \)
Lemma E.3
As \(n \rightarrow \infty \),
Proof
By defining \(E_i:= S_i W_i + (1 - S_i) G_i \sim \mathcal {N}(0,1)\), we have
Note that \(E_i\) and \(S_i\) are not independent. Thus,
since by Law of Large Numbers the last two terms will vanish asymptotically and the first term will converge to \(\mathbb {E}(E_i^2) = 1\). Moreover,
where the last line is obtained by applying CLT to the first term and LLN to the second term. \(\square \)
Lemma E.4
As \(n \rightarrow \infty \),
Proof
We first show
Recall that \(Y_i\) can be seen as a mixture of two normal distributions \(\mathcal {N}(0,1)\) and \(\mathcal {N} \left( \frac{h_0}{\sqrt{n}},1 \right) \) with weights \(1-q_1\) and \(q_1\). Thus, \(\mathbb {E} \left( \sqrt{n} q_{i,1}^{\star } Y_i \right) \) is equal to
Note that with a change of variable \(h = h_0/\sqrt{n}\),
Similarly,
Equation E12 is thereby established. Then, we compute \( \lim _{n \rightarrow \infty }\text {Var}(q_{i,1}^{\star }Y_i)\) using the same strategy.
Combining Eq. E12 and Eq. E13, the lemma is thus established by central limit theorem. \(\square \)
Following exactly the same logic, we have the following parallel lemma for \(j \ne 1\).
Lemma E.5
For \(j \ne 1\), as \(n \rightarrow \infty \),
Proof
We first show
Again, recall that \(Y_i\) can be seen as a mixture of two normal distributions \(\mathcal {N}(0,1)\) and \(\mathcal {N} \left( \frac{h_0}{\sqrt{n}},1 \right) \) with weights \(1-q_1\) and \(q_1\). Thus, \(\mathbb {E} \left( \sqrt{n} q_{i,j}^{\star } Y_i \right) \) is equal to
With a change of variable \(h = h_0/\sqrt{n}\), we have
Similarly,
Finally, we have \(\lim _{n \rightarrow \infty }\text {Var}(q_{i,1}^{\star }Y_i) = q_j^2\) as well, which by CLT finishes the proof. \(\square \)
We can further write down their asymptotic joint distribution. We note that \(q_{i,j}^{\star } = \frac{q_j}{q_2} q_{i,2}^{\star }\) deterministically for \(j>2\); thus, it suffices to only include \(j = 1,2\) in the joint asymptotic distribution.
Lemma E.6
As \(n \rightarrow \infty \),
where
and \(\Sigma _3 \in \mathbb {R}^{3\times 3}\) is equal to
In other words, asymptotically these three random variables are completely linearly correlated.
Proof
By Lemma E.4, it suffices to show
which can be established by the following three displays. First,
secondly,
and finally
\(\square \)
Appendix F: Additional simulations in normal-means model
To show that our results presented in Sect. 3.3 are not sensitive to the initially chosen adaptive parameters and to also further optimize for multiple adaptive procedures A as shown in Algorithm 1, we create Fig. 5. Figure 5 shows the power of the ART using different combinations of the adaptive parameters, \(\epsilon \) and reweighting value \(t_0\), in three different scenarios of p and \(h_0\).
Figure 5 shows that an adaptive procedure with exploration parameter \(\epsilon = 0.7\) seems to be a favorable choice across different signal strengths. Additionally, we find that the optimal reweighting parameter t can be different across different scenarios but does not seem to matter largely across the different scenarios. We find that our initially chosen parameter of \(\epsilon = 0.5\) in Sect. 3.3 was not necessarily the most optimal choice, demonstrating the robustness of the results presented in Sect. 3.3
Appendix G: Details of ART in conjoint analysis
In this section, we first give further details of the ART used in Sect. 4 along with some additional simulation results.
1.1 G.1 Simulation setup
For our simulation setup, (X, Z) each contain one factor with four levels, i.e., \(X_t^L, X_t^R,Z_t^L, Z_t^R\) take values 1, 2, 3, 4. The response model follows a logistic regression model with main effects and interactions on only one specific combination,
where the first four indicators force main effects \(\beta _X, \beta _Z\) of X and Z, respectively, on the first levels of each factor and the last two indicators force an interaction effect \(\beta _{XZ}\) between the first and second level of factors X and Z. For example, \(\textbf{1}\{X_t^L = 1, Z_t^L = 2, X_t^R \ne 1, Z_t^R \ne 2 \}\) is one if the left profile values of (X, Z) are (1, 2), respectively, but the right profile values of (X, Z) are not (1, 2) simultaneously. We note that the interaction indicator is still one if \((X_t^L, Z_t^L) = (1,2)\) and \((X_t^R, Z_t^R) = (1, 3)\) as long as both \((X_t^L, Z_t^L)\) and \((X_t^R, Z_t^R)\) are not (1, 2) simultaneously. For the left plot of Fig. 6, \(\beta _X = \beta _Z = 0.6\) while \(\beta _{XZ} = 0.9\) while we vary the sample size in the x-axis. For the right plot of Fig. 6, the interaction \(\beta _{XZ} = 0\) while we vary \(\beta _X = \beta _Z = (0, 0.3, 0.6, 0.9, 1.2)\) in the x-axis with a fixed sample size of \(n = 1,000\). Lastly, our response model assumes “no profile order effect” since all main and interaction effects are repeated symmetrically for the right and left profile (except we shift the sign because \(Y = 1\) refers to the left profile being selected).
1.2 G.2 Adaptive procedure and test statistic
We first give a detailed description of our adaptive procedure and then formally define the test statistic used in Sect. 4.
We define \(X_t \sim \text {Multinomial}(p_{t, 1}^X, p_{t, 2}^X, \dots , p_{t, K^2}^X)\), where \(p_{t, j}^X\) represents the probability of sampling the jth arm (arm refers to each unique combination of left and right factor levels) out of \(K^2\) possible arms and K is the total levels of X. For example, in our simulation setup \(K = 4\) and there are 16 possible arms, (1, 1), (1, 2), etc., and \(p_{t,j}^Z\) is defined similarly. The uniform iid sampling procedure pulls each arm with equal probability, i.e., \(p_{t,j}^X = \frac{1}{K^2}, p_{t,j}^Z = \frac{1}{L^2}\) for every j and L is the total number of factor levels for factor Z.Footnote 4 Although we present our adaptive procedure when Z contains only one other factor (typical conjoint analysis have 8–10 other factors), our adaptive procedure loses no generality in higher dimensions of Z.
We now propose the following adaptive procedure that adapts the sampling weights of \(p_{t, j}^X, p_{t, j}^Z\) at each time step t in the following way:
where \(\bar{Y}_{j,t}^X\) denotes the sample mean of \(Y_1, Y_2, \dots , Y_{t-1}\) for arm j in variable X, \(\bar{Y}_{j,t}^Z\) is defined similarly, and \(N(0, 0.01^2)\) denotes a Gaussian random variable with mean zero and variance \(0.01^2\) (the two Gaussians in Eq. (G14) are drawn independently). Equation (G14) samples more from arms that look like signal (further away from 0.5). We add a slight perturbation in case \(\bar{Y}_{j,t}^X\) is exactly equal to 0.5 at any time point t to discourage an arm from having zero probability to be sampled.
With this reweighting procedure, we build our adaptive procedure. Just like Definition 3.2, we also have an \(\epsilon \) adaptive parameter that denotes the beginning \([n\epsilon ]\) samples that are used for “exploration” by using the typical uniform iid sampling procedure. In the remaining samples, we adapt by changing the weights according to Equation (G14). This adaptive sampling procedure immediately satisfies Assumption 1 and also Assumption 2 since each variable only looks at its own history and previous responses. Algorithm 2 summarizes the adaptive procedure.
We now give the test statistic under consideration. Although Ham et al. (2022) considered a complex Hierarchical Lasso model to capture all second-order interactions, we consider a simple cross-validated Lasso logistic test statistic that fits a Lasso logistic regression of \(\textbf{Y}\) with main effects of \(\textbf{X}\) and \(\textbf{Z}\) and their interactions due to the simplicity of this simulation setting. This leads to the following test statistic:
where \({\hat{\beta }}_k\) denotes the estimated main effects for level k out of K levels of X (one is held as baseline) and \({\hat{\gamma }}_{kl}\) denotes the estimated interaction effects for level k of X with level l of L total levels of Z.
This test statistic also imposes the “no profile order effect” constraints, i.e., we do not separately estimate coefficients for the left and right profiles to increase power. When fitting a Lasso logistical regression of \(\textbf{Y}\) with main effects and interaction of \((\textbf{X}, \textbf{Z})\), we obtain a separate effect for both the left and right effects. Since the “no profile order effect” constraints the left and right effects to be similar, we formally impose the following constraints:
where the superscripts L and R denote the left and right profile effects, respectively. To incorporate this symmetry constraint, we split our original \(\mathbb {R}^{n \times (4 + 1)}\) data matrix \((\textbf{X},\textbf{Z}, \textbf{Y})\) into a new data matrix with dimension \(\mathbb {R}^{2n \times (2 + 1)}\), where the first n rows contain the values for the left profile (and the corresponding Y) and the next n rows contain the values for the right profile with new response \(1-Y\), Ham et al. (2022) shows that this formally imposes the constraints in Eq. (16) by destroying any profile order information in the new data matrix.
1.3 G. 3 Simulation results
We first compare the power of our adaptive procedure stated in Algorithm 2 with the iid setting where each arm for X and Z are drawn uniformly at random under the simulation setting described in Appendix G.1. We empirically compute the power as the proportion of 1000 Monte Carlo p values less than \(\alpha = 0.05\).
For the left panel of Fig. 6, we increase sample size when there exist both main effects and interaction effects of X. More specifically, we vary our sample size \(n = (450, 600, 750, 1,000, 1,300)\) while fixing the main effects of X and Z at 0.6 and a stronger interaction effect at 0.9 (these refer to the coefficients of the logistic response model defined in Appendix G). For the right panel of Fig. 6, we increase the main effects of X and Z with no interaction effect and a fixed sample size at \(n = 1,000\). We also vary the exploration parameter \(\epsilon \) in Algorithm 2 to \(\epsilon = 0.25, 0.5, 0.75\).
Both panels of Fig. 6 show that the power of the ART with the proposed adaptive sampling procedure is uniformly greater than that of the CRT with a typical uniform iid sampling procedure (green). For example when \(n = 1,000\) in the left panel, there is a difference in 8.5 percentage points (59% versus 67.5%) between the iid sampling procedure and the adaptive sampling procedure with \(\epsilon = 0.5\) (red). When the main effect is as strong as 1.2 in the right panel, there is a difference in 24 percentage points (57% versus 81%) between the iid sampling procedure and the adaptive sampling procedure with \(\epsilon = 0.5\). Additionally, when the main effect is 0 in the right panel, thus under \(H_0\), the power of all methods, as expected, has type 1 error control as the power for all methods is near \(\alpha = 0.05\) (dotted black horizontal line).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ham, D.W., Qiu, J. Hypothesis testing in adaptively sampled data: ART to maximize power beyond iid sampling. TEST 32, 998–1037 (2023). https://doi.org/10.1007/s11749-023-00861-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-023-00861-2
Keywords
- Conditional independence testing
- Randomization inference
- Adaptive sampling
- Model-X
- Nonparametric testing