Hypothesis testing in adaptively sampled data: ART to maximize power beyond iid sampling

Ham, Dae Woong; Qiu, Jiaze

doi:10.1007/s11749-023-00861-2

Hypothesis testing in adaptively sampled data: ART to maximize power beyond iid sampling

Original paper
Published: 02 May 2023

Volume 32, pages 998–1037, (2023)
Cite this article

TEST Aims and scope Submit manuscript

154 Accesses
Explore all metrics

Abstract

Testing whether a variable of interest affects the outcome is one of the most fundamental problems in statistics and is often the main scientific question of interest. To tackle this problem, the conditional randomization test (CRT) is widely used to test the independence of variable(s) of interest (X) with an outcome (Y) holding other variable(s) (Z) fixed. The CRT uses “Model-X” inference framework that relies solely on the iid sampling of (X, Z) to produce exact finite-sample p values that are constructed using any test statistic. We propose a new method, the adaptive randomization test (ART), that tackles the same independence problem while allowing the data to be adaptively sampled. Like the CRT, the ART relies solely on knowing the (adaptive) sampling distribution of (X, Z). Although the ART allows practitioners to flexibly design and analyze adaptive experiments, the method itself does not guarantee a powerful adaptive sampling procedure. For this reason, we show substantial power gains obtained from adaptively sampling compared to the typical iid sampling procedure in a multi-arm bandit setting and an application in conjoint analysis. We believe that the proposed adaptive procedure is successful because it takes arms that may initially look like “fake” signals due to random chance and stabilizes them closer to “null” signals and samples more/less from signal/null arms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computing the Performance of a New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards

Confidence intervals for single-case effect size measures based on randomization test inversion

Article 29 February 2016

Nonparametric meta-analysis for single-case research: Confidence intervals for combined effect sizes

Article 16 April 2018

Notes

Neither the CRT nor our paper needs to assume the existence of the pdf. However, for clarity and ease of exposition, we present the data generating distribution with respect to a pdf.
We remind the reader that p is used to denote the cardinality of $\mathcal {X}$ as opposed to the dimension of $\mathcal {X}$.
$q^{\star }$ is not formally the most optimal iid sampling procedure for all possible iid sampling procedure since we consider the maximum power when only varying $q_1$ while imposing the remaining arms to all have equal probabilities. However, we do not imagine any other reasonable iid sampling procedure to have a stronger power than $q^{\star }$ since the remaining $p-1$ arms with no signals are not differentiable in any way; thus, we lose no generality by setting them with equal probability.
We also note that conjoint applications do indeed default to the uniform iid sampling procedure (or a very minor variant from it) (Hainmueller and Hopkins 2015; Ono and Burden 2018).

References

Arrow KJ (1998) What has economics to say about racial discrimination? J Econ Perspect 12(2):91–100. https://doi.org/10.1257/jep.12.2.91
Article Google Scholar
Ash RB, Doleans-Dade CA (1999) Probability and measure theory, 2nd edn. Harcourt/Academic Press, Burlington, MA
MATH Google Scholar
Bates S, Sesia M, Sabatti C, Candès E (2020) Causal inference in genetic trio studies. Proc Natl Acad Sci 117(39):24117–24126. https://doi.org/10.1073/pnas.2007743117
Article MathSciNet MATH Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57(1):289–300
MathSciNet MATH Google Scholar
Berrett T, Wang Y, Barber R, Samworth R (2019) The conditional permutation test for independence while controlling for confounders. J R Stat Soc: Ser B (Stat Methodol). https://doi.org/10.1111/rssb.12340
Article MATH Google Scholar
Bojinov I, Shephard N (2019) Time series experiments and causal estimands: exact randomization tests and trading. J Am Stat Assoc
Candès E, Fan Y, Janson L, Lv J (2018) Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. J Roy Stat Soc B 80(3):551–577
Article MathSciNet MATH Google Scholar
Farronato C, MacCormack A, Mehta S (2018) Innovation at uber: the launch of express pool. Harvard Business School Case) 82
Glynn P, Johari R, Rasouli M (2020) Adaptive experimental design with temporal interference: a maximum likelihood approach. https://doi.org/10.48550/ARXIV.2006.05591
Hainmueller J, Hopkins DJ (2015) The hidden American immigration consensus: a conjoint analysis of attitudes toward immigrants. Am J Polit Sci. https://doi.org/10.1111/ajps.12138
Article Google Scholar
Ham DW, Imai K, Janson L (2022). Using machine learning to test causal hypotheses in conjoint analysis. https://doi.org/10.48550/arXiv.2201.08343
Article Google Scholar
Imbens GW, Rubin DB (2015) Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press, Cambridge
Book MATH Google Scholar
James W, Stein C (1961) Estimation with quadratic loss Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1: Contributions to the Theory of Statistics. Univ. of California Press, Berkeley, CA
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6(1):4–22. https://doi.org/10.1016/0196-8858(85)90002-8
Article MathSciNet MATH Google Scholar
Le Cam L (1956) On the asymptotic theory of estimation and testing hypotheses. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp 129–156. University of California Press
Liu JS (2008) Monte Carlo strategies in scientific computing. Springer, New York, Berlin, Heidelberg, p 344
Google Scholar
Luce RD, Tukey JW (1964) Simultaneous conjoint measurement: a new type of fundamental measurement. J Math Psychol 1(1):1–27. https://doi.org/10.1016/0022-2496(64)90015-X
Article MATH Google Scholar
Lupia A, Mccubbins M (2000) The democratic dilemma: Can citizens learn what they need to know? Am Polit Sci Rev. https://doi.org/10.2307/2586046
Article Google Scholar
Offer-Westort M, Coppock A, Green DP (2021) Adaptive experimental design: prospects and applications in political science. Am J Polit Sci 65(4):826–844. https://doi.org/10.1111/ajps.12597
Article Google Scholar
Ono Y (2018). Replication Data for: The contingent effects of candidate sex on voter choice. https://doi.org/10.7910/DVN/IZKZET
Article Google Scholar
Ono Y, Burden BC (2018) The contingent effects of candidate sex on voter choice. Polit Behav 1–25
Rosenberger WF, Uschner D, Wang Y (2019) Randomization: the forgotten component of the randomized clinical trial. Stat Med 38(1):1–12. https://doi.org/10.1002/sim.7901
Article MathSciNet Google Scholar
Shi C, Xiaoyu W, Luo S, Zhu H, Ye J, Song R (2022) Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework. J Am Stat Assoc. https://doi.org/10.1080/01621459.2022.2027776
Article Google Scholar
Skarnes W, Rosen B, West A, Koutsourakis M, Roake W, Iyer V, Mujica A, Thomas M, Harrow J, Cox T, Jackson D, Severin J, Biggs P, Fu J, Nefedov M, de Jong P, Stewart A, Bradley A (2011) A conditional knockout resource for the genome-wide study of mouse gene function. Nature 474:337–42. https://doi.org/10.1038/nature10163
Article Google Scholar
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. A Bradford Book, MIT press, Cambridge, MA
MATH Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294. https://doi.org/10.1093/biomet/25.3-4.285
Article MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288
MathSciNet MATH Google Scholar
Wainwright MJ, Jordan MI, et al. (2008) Graphical models, exponential families, and variational inference. Foundations Trends® in Machine Learning 1(1–2):1–305
Wu J, Ding P (2021) Randomization tests for weak null hypotheses in randomized experiments. J Am Stat Assoc 116(536):1898–1913. https://doi.org/10.1080/01621459.2020.1750415
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank Lucas Janson, Iavor Bojinov and Subhabrata Sen for advice and feedback.

Author information

D. W. Ham and J. Qiu: These authors contributed equally to this work.

Authors and Affiliations

Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, MA, 02138, USA
Dae Woong Ham & Jiaze Qiu

Authors

Dae Woong Ham
View author publications
You can also search for this author in PubMed Google Scholar
Jiaze Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Dae Woong Ham or Jiaze Qiu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Asymptotic results for the normal-means model

The two asymptotic power analysis results we omitted for conciseness of the main text in Sect. 3.2 are stated here. Proofs of results presented in this section are all in Appendix E.

Theorem A.1

(Normal-Means Model: Power of RT under iid sampling procedures) Upon taking $B \rightarrow \infty $, the asymptotic power of the iid sampling procedure with probability weight vector $q = (q_1,q_2,\cdots ,q_p)$, as defined in Definition 3.1, with respect to the RT with the “maximum ” test statistic, is equal to

$$\begin{aligned} \text {Power}_{\text {iid}}(q) = \mathbb {P} \left( T_{\text {iid}} \ge z_{1-\alpha }\left( \tilde{T}_{\text {iid}} \right) \right) \text {,} \end{aligned}$$

where $z_{1 - \alpha }$ is the $1 - \alpha $ quantile of the distribution of $\tilde{T}_{\text {iid}}$. $T_{\text {iid}}$ and $ \tilde{T}_{\text {iid}}$ are defined/generated as a function of $G:= (G_1, G_2, \dots , G_{p-1})$ and $H:= (H_1, H_2, \dots , H_{p-1})$, both of which are independent and follow the same $(p-1)$-dimensional multivariate Gaussian distribution $\mathcal {N} \left( 0, \Sigma (q) \right) $. $T_{\text {iid}}$ and $\tilde{T}_{\text {iid}}$ are then defined as

$$\begin{aligned} \begin{aligned}&T_{\text {iid}} = T_{\text {iid}} \left( q, G, H\right) \\&\quad := \max \left( \left\{ H_1 + h_0 \right\} \cap \{ H_j,j = 2,\dots ,p-1\} \cap \left\{ - \frac{1}{q_p} \sum _{i=1}^{p-1} q_j H_j \right\} \right) ; \end{aligned}\\ \begin{aligned}&\tilde{T}_{\text {iid}} =\tilde{T}_{\text {iid}} \left( q, G, H \right) := h_0 q_1 \\&\quad + \max \left( \left\{ G_j,j = 1,\dots ,p-1 \right\} \cap \left\{ - \frac{1}{q_p} \sum _{j=1}^{p-1}q_j G_j \right\} \right) \text {.} \end{aligned} \end{aligned}$$

Finally, $\Sigma (q)$ is specified by

$$\begin{aligned} \Sigma (q):= D(q)^{-1} \Sigma _0(q) D(q)^{-1} \text {.} \end{aligned}$$

(A1)

where matrices $\Sigma _0$ and D are defined as

$$\begin{aligned} \Sigma _0(q):= \begin{bmatrix} v(q_1) &{} -q_1 q_2 &{} \cdots &{} -q_1 q_{p-1}\\ -q_1 q_2 &{} v(q_2) &{} \cdots &{} -q_2 q_{p-1} \\ \cdots &{} \cdots &{} \cdots &{} \cdots \\ -q_1 q_{p-1} &{} -q_2 q_{p-1} &{} \cdots &{} v(q_{p-1}) \end{bmatrix} \in \mathbb {R}^{(p-1)\times (p-1)} \text {,} \end{aligned}$$

with $v(x) = x(1-x)$, and $D(q):= \text {diag}(q_1,q_2,\dots ,q_{p-1}) \in \mathbb {R}^{(p-1)\times (p-1)}$.

We note that if we assume p to be “large ” (in a generic sense) and our sampling probabilities $q_j = O(1/p)$ for all j, then the diagonal elements of $\Sigma (q)$ will be generally much larger than the off-diagonal elements. Consequently, G and H in Theorem A.1 will have approximately independent coordinates; thus, both $T_{\text {ind}}$ and $\tilde{T}_{\text {iid}}$ are characterized by nearly independent Gaussian distributions.

By an argument similar to proof for Theorem A.1, we can also derive the asymptotic power for our two-stage adaptive sampling procedures.

Theorem A.2

(Normal-Means Model: Power of the ART under two-stage adaptive sampling procedures) Upon taking $B \rightarrow \infty $, the asymptotic power of a two-stage adaptive sampling procedures with exploration parameter $\epsilon $, reweighting function f, scaling parameter t and test statistic T as defined in Definition 3.2, with respect to the ART with the “maximum ” test statistic, is equal to

$$\begin{aligned} \text {Power}_{\text {adap}} \left( \epsilon , t, f \right):= & {} \mathbb {P}_{ R^{\text {F}}, G^{\text {F}},R^{\text {S}}, H^{\text {F}} } \bigg ( \mathbb {P} \bigg ( T_{\text {adap}} \ge z_{1-\alpha }(\tilde{T}_{\text {adap}} \mid R^{\text {F}},R^{\text {S}},H^{\text {F}},H^{\text {S}} ) \nonumber \\{} & {} \bigg |R^{\text {F}},R^{\text {S}}, H^{\text {F}}, H^{\text {S}} \bigg ) \bigg ) \end{aligned}$$

(A2)

where $z_{1-\alpha }(\tilde{T}_{\text {adap},j}\mid R^{\text {F}},R^{\text {S}}, H^{\text {F}}, H^{\text {S}})$ denotes the $1-\alpha $ quantile of the conditional distribution of $\tilde{T}_{\text {adap}}$ given $ R^{\text {F}}$, $R^{\text {S}}$, $ G^{\text {F}}$ and $ G^{\text {S}}$. Furthermore,

$$\begin{aligned} T_{\text {adap}}= & {} \max _{j \in \{ 1,2,\dots ,p \}} T_{\text {adap},j};\quad \quad \tilde{T}_{\text {adap}} = \max _{j \in \{ 1,2,\dots ,p \}} \tilde{T}_{\text {adap},j};\\ T_{\text {adap},j}= & {} \frac{ q_j \sqrt{\epsilon } W_j + Q_j \sqrt{(1 - \epsilon )} \left[ H^{\text {S}}_j + R^{S} + \textbf{1}_{j=1} \sqrt{1 -\epsilon } h_0 \right] }{\epsilon q_j + (1 - \epsilon ) Q_j}; \\ \tilde{T}_{\text {adap},j}= & {} \frac{ q_j \sqrt{\epsilon } \tilde{W}_j + \tilde{Q}_j \sqrt{(1 - \epsilon )} \left( G^{\text {S}}_j + R^{\text {S}} + \sqrt{1 -\epsilon } h_0 Q_1 \right) }{\epsilon q_j + (1 - \epsilon ) \tilde{Q}_j }, \end{aligned}$$

where $R^{\text {F}}$, $R^{\text {S}}$, $G^{\text {F}}$, $G^{\text {S}}$, $H^{\text {F}}$, $H^{\text {S}}$, Q, $\tilde{Q}$, W and $\tilde{W}$ are random quantities generated from the following procedure. First, generate $R^{\text {F}} \sim \mathcal {N}(0,1)$, $G^{\text {F}} \sim \mathcal {N}\left( 0, \Sigma (q) \right) $, and $H^{\text {F}} \sim \mathcal {N}\left( 0, \Sigma (q) \right) $ independently, where $\Sigma (\cdot )$ is defined in Eq. A1. Second, compute

$$\begin{aligned} \begin{aligned} W_j&= H^{\text {F}}_j + R^{\text {F}} + \textbf{1}_{j=1} \sqrt{\epsilon } h_0 \text {, for } j \in \{1,2,\dots , p-1 \} \text {,}\\ \tilde{W}_j&= G_j^{\text {F}} + R^{\text {F}} + \sqrt{\epsilon } h_0 q_1 \text {, for } j \in \{1,2,\dots , p-1 \} \text {,}\\ W_p&= - \frac{1}{q_p} \sum _{j=1}^{p-1} q_j H^{\text {F}}_j + R^{\text {F}} +\sqrt{\epsilon } h_0 q_1(1 - q_1) \text {,} \\ \tilde{W}_p&=- \frac{1}{q_p} \sum _{j=1}^{p-1} q_j G^{\text {F}}_j + R^{\text {F}} + \sqrt{\epsilon } h_0 q_1 \text {.} \end{aligned} \end{aligned}$$

Third, compute

$$\begin{aligned} \begin{aligned} Q_j&= \frac{f(W_j / \sqrt{\epsilon })}{\sum _{j=1}^{p} f(W_j / \sqrt{\epsilon })} \quad \text { and } \quad \tilde{Q}_j&= \frac{f(\tilde{W}_j / \sqrt{\epsilon })}{\sum _{j=1}^{p} f(\tilde{W}_j / \sqrt{\epsilon })} \text {.} \end{aligned} \end{aligned}$$

We note that with a slight abuse of notation, the Q defined here is the asymptotic distributional characterization of Eq. 6. Lastly, generate $R^{\text {S}}\sim \mathcal {N}(0,1)$, $H^{\text {S}} \sim \mathcal {N} \left( 0, \Sigma (Q)\right) $ and $G^{\text {S}} \sim \mathcal {N} \left( 0,\Sigma \left( \tilde{Q} \right) \right) $ independently.

Appendix B: Multiple testing

Our proposed method tests $H_0$ for a single variable of interest X conditional on other experimental variables Z. However, the practitioner may be interested in testing multiple $H_0$ for multiple variables of interest (including variables from Z).

To formalize this, denote $X = (X^1, X^2, \dots , X^{p})$ to contain p variables of interest, each of which can also be multidimensional. Informally speaking, our objective is to perform p tests of $Y \perp \!\!\! \perp X^{j} \mid X^{-j}$ for $j = 1, 2, \dots p$, where $X^{-j}$ denotes all variables in X except $X^j$. Given a fixed j, our proposed methodology in Sect. 2.1–2.3 can be used to test any single one of these hypothesis. The main issue with directly extending our proposed methodology for testing all $j = 1, 2, \dots p$ variables is that Assumption 1 does not allow $X^{-j}$ to depend on previous $X^{j}$ but $X^{j}$ may depend on previous $X^{-j}$ when testing a single hypothesis $Y \perp \!\!\! \perp X^{j} \mid X^{-j}$. This asymmetry may cause this assumption to hold when testing for $X^j$ but simultaneously not hold when testing for $X^{j'}$ for $j\ne j'$. Thus, in order to satisfy Assumption 1 for all variables of interest simultaneously, we modify our procedure such that each $X_t^j$ is independent of $X_{t'}^{j'}$ for all $j, j'$ and $t' \le t$. In other words, we force each $X_t^j$ to be sampled according to its own history $X_{1:(t-1)}^j$ and the history of the response but not the history and current values of $X^{j'}$ for $j \ne j'$ and for every j. We formalize this in the following assumption.

Assumption 2

(Each $X^j$ does not adapt to other $X^{j'}$) For each $t = 1, 2, \dots , n$ suppose each $X_t = (X_t^1, X_t^2, \dots , X_t^p)$ are sampled according to a sequential adaptive sampling procedure A: $X_t \sim f_{t}^A(x_t^1, x_t^2, \dots , x_t^p \mid x^{-j}_{1:(t-1)}, x_{1:(t-1)}^j, y_{1:(t-1)})$. We say an adaptive procedure A satisfies Assumption 2 if $ f_{t}^A $ can be written into the following factorized form, for $t = 2, 3, \dots , n$,

$$\begin{aligned} f_{t}^A(x_t^1, x_t^2, \dots , x_t^p \mid x^{-j}_{1:(t-1)}, x_{1:(t-1)}^j, y_{1:(t-1)}) = \prod _{j = 1}^p f_{t, j}^A (x_t^j \mid x_{1:(t-1)}^j, y_{1:(t-1)}^j) \end{aligned}$$

with every $f_{t,j}^A(\cdot \mid x^t_{1:(t-1)}, y^t_{1:(t-1)})$ being a valid probability measure for all possible values of $(x^t_{1:(t-1)}, y^t_{1:(t-1)})$.

Assumption 2 states that $X^{j}$ can not adapt based on the history of any other $X^{j'}$ for all $j \ne j'$. This assumption is sufficient to satisfy Assumption 1 when testing $H_0$ for any $X^j$ for any $j = 1, 2, \dots , p$, thus leading to a valid p value for every $X^j$ simultaneously when using the proposed ART procedure in Algorithm 1. Although our framework gives valid p values for each of the multiple tests, we need to further account for multiple testing issues. For example, one naïve way to control the false discovery rate is to use the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995), but this is not the focus of our paper.

Appendix C: Discussion of the natural adaptive resampling procedure

Keen readers may argue the NARP is merely a practical choice but an unnecessary one, thus no longer requiring Assumption 1. Exchangeability requires $(\textbf{X}, \textbf{Z}, \textbf{Y})$ and $(\tilde{\textbf{X}}^b, \textbf{Z}, \textbf{Y})$ to be equal in distribution. Consequently, if one could sample the entire data vector $\tilde{\textbf{X}}$ from the conditional distribution of $\textbf{X} \mid (\textbf{Z}, \textbf{Y})$, then this construction of $\tilde{\textbf{X}}$ would satisfy the required distributional equality. In general, however, it is well known that it is difficult to sample from a complicated graphical model (Wainwright et al. 2008). To illustrate this, we show how constructing valid resamples $\tilde{\textbf{X}}^b$ for even two time periods may be difficult without Assumption 1 with the following equations:

$$\begin{aligned} \begin{aligned}&P(X_1 = x_1, X_2 = x_2 \mid Z_1 = z_1, Z_2 = z_2, Y_1 = y_1, Y_2 = y_2) \\&\quad = \frac{ P(X_2 = x_2 \mid X_1 = x_1, Z_1 = z_1, Z_2 = z_2, Y_1 = y_1) }{\int _{x} P(Z_2 = z_2 \mid X_1 = x, Y_1 = y_1, Z_1 = z_1) \textrm{d} P(X_1 = x\mid Z_1 = z_1)} \\&\qquad \cdot P(Z_2 = z_2 \mid X_1 = x_1, Y_1 = y_1, Z_1 = z_1) P(X_1 =x_1 \mid Z_1 = z_1) \\&\quad \propto P(X_2 = x_2 \mid X_1 = x_1, Z_1 = z_1, Z_2 = z_2, Y_1 = y_1) \\&\qquad \cdot \left[ P(Z_2 = z_2 \mid X_1 = x_1,Y_1 = y_1, Z_1 = z_1) P(X_1 = x_1 \mid Z_1 = z_1) \right] . \end{aligned} \end{aligned}$$

This follows directly from elementary probability calculations. Since any valid construction of $\tilde{\textbf{X}}^b$ must have that $P(\tilde{X}_1 = x_1, \tilde{X}_2 = x_2 \mid Z_1 = z_1, Z_2 = z_2, Y_1 = y_1, Y_2 = y_2) = P(X_1 = x_1, X_2 = x_2 \mid Z_1 = z_1, Z_2 = z_2, Y_1 = y_1, Y_2 = y_2)$, the above equation shows that it is generally hard to construct valid resamples due to the normalizing constant in the denominator of the second line. We further note that Assumption 1 bypasses this problem because $ P(Z_2 = z_2 \mid X_1=x_1, Y_1 = y_1, Z_1 = z_1) $ is now independent of the condition $ X_1 = x_1 $. Therefore, the denominator in the second line is always $ P(Z_2 = z_2 \mid X_1 = x_1, Y_1 = y_1, Z_1 = z_1)$, canceling out with the numerator.

Although sampling from a distribution that is known up to a proportional constant has been extensively studied in the Markov chain Monte Carlo (MCMC) literature (Liu 2008), many MCMC methods introduce extra computational burden to an already computationally expensive algorithm that requires $B + 1$ resamples and computation of test statistic T. Moreover, it is unclear how “approximate” draws from the desired distribution in a MCMC algorithm may impact the exact validness of the p values. This problem may be exacerbated when the sample size n is large because the errors for each resamples could exponentially accumulate across time. Therefore, we choose to use the NARP along with Assumption 1 as the proposed method because it avoids these complications.

Appendix D: Proof of main results presented in Sect. 2

Proof of Theorem 2.3

By definition of our resampling procedure, under $\text {H}_0$,

$$\begin{aligned} \tilde{X}_1 \mid (Y_1, Z_1) \overset{\text {d}}{=} \tilde{X}_1 \mid Z_1 \overset{\text {d}}{=} X_1 \mid Z_1 \overset{\text {d}}{=} X_1 \mid (Y_1, Z_1) \end{aligned}$$

where the last “$\overset{\text {d}}{=}$ ” is by the null hypothesis of conditional independence, namely $X_1 \perp \!\!\! \perp Y_1 \mid Z_1$. Moreover, it also suggests

$$\begin{aligned} (\tilde{X}_1,Y_1, Z_1) \overset{\text {d}}{=} (X_1,Y_1, Z_1) \text {.} \end{aligned}$$

Then, we will prove the following statement holds for any $k \in \{ 1,2,\dots ,n\}$ by induction,

$$\begin{aligned} (\tilde{X}_{1:k},Y_{1:k}, Z_{1:k}) \overset{\text {d}}{=} (X_{1:k},Y_{1:k}, Z_{1:k}) \text {.} \end{aligned}$$

(D3)

Assuming Eq. D3 holds for $k-1$, we now prove it also holds for k. For simplicity, in the rest of this proof, we will use $P(\cdot )$ as a generic notation for pdf or pmf, though the proof holds for more general distributions without a pdf or pmf. First,

$$\begin{aligned}{} & {} P \left[ ( \tilde{X}_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \nonumber \\{} & {} \overset{\text {(i)}}{=} P \left[ Z_k \mid ( \tilde{X}_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:(k-1)}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:(k-1)}) \right] \nonumber \\{} & {} \cdot P \left[ ( \tilde{X}_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:(k-1)}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:(k-1)})\right] \nonumber \\{} & {} \overset{\text {(ii)}}{=} P \left[ Z_k \mid ( Y_{1:(k-1)}, Z_{1:(k-1)}) = ( y_{1:(k-1)}, z_{1:(k-1)}) \right] \nonumber \\{} & {} \cdot P \left[ ( \tilde{X}_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:(k-1)}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:(k-1)})\right] \nonumber \\{} & {} \overset{\text {(iii)}}{=} P \left[ Z_k \mid ( Y_{1:(k-1)}, Z_{1:(k-1)}) = ( y_{1:(k-1)}, z_{1:(k-1)}) \right] \nonumber \\{} & {} \cdot P \left[ ( X_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:(k-1)}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:(k-1)})\right] \nonumber \\{} & {} \overset{\text {(iv)}}{=} P \left[ Z_k \mid ( X_{1:(k-1)} Y_{1:(k-1)}, Z_{1:(k-1)}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:(k-1)}) \right] \nonumber \\{} & {} \cdot P \left[ ( X_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:(k-1)}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:(k-1)})\right] \nonumber \\{} & {} = P \left[ ( X_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \text {,} \end{aligned}$$

(D4)

where (i) is simply by Bayes rule; (ii) is because $Z_k \perp \!\!\! \perp \tilde{X}_{1:{k-1}} \mid (Y_{1:(k-1)}, Z_{1:(k-1)})$ since $\tilde{X}_{1:{k-1}}$ is a random function of only $Y_{1:(k-1)}$ and $Z_{1:(k-1)}$; and lastly, (iii) is by induction assumption; (iv) is by Assumption 1. Moreover,

$$\begin{aligned} \begin{aligned}&P \left[ ( \tilde{X}_{1:k}, Y_{1:k}, Z_{1:k}) = (x_{1:k}, y_{1:k}, z_{1:k}) \right] \\&\quad \overset{\text {(i)}}{=} P \left[ Y_k = y_k \mid ( \tilde{X}_{1:k}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:k}, y_{1:(k-1)}, z_{1:k}) \right] \\&\qquad \cdot P \left[ ( \tilde{X}_{1:k}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:k}, y_{1:(k-1)}, z_{1:k}) \right] \\&\quad \overset{\text {(ii)}}{=} P \left[ Y_k = y_k \mid Z_k = z_k \right] \cdot P \left[ ( \tilde{X}_{1:k}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:k}, y_{1:(k-1)}, z_{1:k}) \right] \\&\quad \overset{\text {(iii)}}{=} P \left[ Y_k = y_k \mid Z_k = z_k \right] \cdot P \left[ \tilde{X}_k = x_k \mid ( \tilde{X}_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \\&\qquad \cdot P \left[ ( \tilde{X}_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \\&\quad \overset{\text {(iv)}}{=} P \left[ Y_k = y_k \mid Z_k = z_k \right] \cdot P \left[ X_k = x_k \mid ( X_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \\&\qquad \cdot P \left[ ( \tilde{X}_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \\&\quad \overset{\text {(v)}}{=} P \left[ Y_k = y_k \mid Z_k = z_k \right] \cdot P \left[ X_k = x_k \mid ( X_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \\&\qquad \cdot P \left[ ( X_{1:(k-1)}, Y_{1:(k-1)}, Z_{1:k}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:k}) \right] \\&\quad = P \left[ ( X_{1:k}, Y_{1:k}, Z_{1:k}) = (x_{1:k}, y_{1:k}, z_{1:k}) \right] \text {,} \end{aligned} \end{aligned}$$

where (i) is again simply by Bayes rule; (ii) is because $Y_k$ is a random function of only $Z_k$ (up to time k) under the null $H_0$ and thus is independent of anything with index smaller or equal to k conditioning on $Z_k$; (iii) is again by Bayes rule; (iv) is by Definition 2.2; and finally (v) is by the previous equation above. Equation D3 is thus established by induction, as a corollary of which, we also get for any $k \le n$,

$$\begin{aligned} \tilde{X}_{1:n} \mid (Y_{1:n}, Z_{1:n}) \overset{\text {d}}{=} X_{1:n} \mid (Y_{1:n}, Z_{1:n}) \end{aligned}$$

Finally, note that $\tilde{X} \perp \!\!\! \perp X \mid (Y,Z)$. So, conditioning on (Y, Z), $\tilde{X}$ and X are exchangeable, which means the p value defined in Equation 5 is conditionally valid, conditioning on (Y, Z). Since $\mathbb {P}\left( p < \alpha \mid Y, Z \right) \le \alpha $ holds conditionally, it also holds marginally. $\square $

Proof of Theorem 2.4

Note that Assumption 1 was only utilized once in the proof of Theorem 2.3, namely (iv) of Eq. D4. So upon assuming $(\tilde{X}_{1:k},Y_{1:k}, Z_{1:k}) \overset{\text {d}}{=} (X_{1:k},Y_{1:k}, Z_{1:k})$, we know immediately from Eq. D4 that

$$\begin{aligned} \begin{aligned}&P \left[ Z_k = z_k \mid ( Y_{1:(k-1)}, Z_{1:(k-1)}) = ( y_{1:(k-1)}, z_{1:(k-1)}) \right] \\&\quad = P \left[ Z_k =z_k \mid ( X_{1:(k-1)} Y_{1:(k-1)}, Z_{1:(k-1)}) = (x_{1:(k-1)}, y_{1:(k-1)}, z_{1:(k-1)}) \right] \end{aligned} \end{aligned}$$

which is exactly Assumption 1. $\square $

Appendix E: Proof of results presented in Appendix A

Before proving the main power results, we first state a self-explanatory lemma concerning the effect of taking B to go to infinity, which justifies assuming B to be large enough and ignoring the effect of discrete p values like the one defined in Equation 5. Similar proof arguments are made in Wu and Ding (2021); thus, we omit the proof of this lemma. The lemma states that as $B \rightarrow \infty $, conditioning on any given values of $(X, \textbf{Y},\textbf{Z})$,

$$\begin{aligned} \begin{aligned} p\text {-value}:= \frac{1}{B+1} \left[ 1 + \sum _{b=1}^B \textbf{1}_{\{T(\tilde{\textbf{X}}^{\textbf{b}}, \textbf{Z}, \textbf{Y}) \ge T(\textbf{X}, \textbf{Z}, \textbf{Y})\}} \right] \\ \overset{\text {a.s.}}{\longrightarrow } \mathbb {P} \left( T(\tilde{\textbf{X}}^{\textbf{b}}, \textbf{Z}, \textbf{Y}) \ge T(\textbf{X}, \textbf{Z}, \textbf{Y}) \mid \textbf{Y}, \textbf{Z} \right) \text {.} \end{aligned} \end{aligned}$$

Lemma E.1

(Power of ART under $B \rightarrow \infty $) For any adaptive sapling procedure A satisfies Definition 2.1 and any test statistic T, as we take $B \rightarrow \infty $, the asymptotic conditional power of ART (with CRT being an degenerate special case) condition on (Y, Z) is equal to

$$\begin{aligned} \mathbb {P} \left( T(\textbf{X},\textbf{Y}, \textbf{Z}) \ge z_{1-\alpha }(T(\tilde{\textbf{X}},\textbf{Y}, \textbf{Z})) \mid \textbf{Y}, \textbf{Z} \right) \text {,} \end{aligned}$$

while the unconditional (marginal) power is equal to

$$\begin{aligned} \mathbb {P}_{\textbf{X}, \tilde{\textbf{X}}, \textbf{Y}, \textbf{Z}} \left( \mathbb {P} \left( T(\textbf{X},\textbf{Y}, \textbf{Z}) \ge z_{1-\alpha }(T(\tilde{\textbf{X}},\textbf{Y}, \textbf{Z})) \mid \textbf{Y}, \textbf{Z} \right) \right) \text {.} \end{aligned}$$

Note that the joint distribution of $(\textbf{X}, \tilde{\textbf{X}}, \textbf{Y}, \textbf{Z})$ is implicitly specified by the sampling procedure A.

Lemma E.2

(Normal-Means Model with iid sampling procedures: Joint Asymptotic Distributions of $\bar{Y}_j$’s, $\tilde{\bar{Y}}_j$’s and $\bar{Y}$ Under the Alternative $\text {H}_1$) Define

$$\begin{aligned} T_{\text {all}} = \left( \tilde{\bar{Y}}_1, \tilde{\bar{Y}}_2, \dots , \tilde{\bar{Y}}_{p-1}, \bar{Y}_1, \bar{Y}_2, \dots , \bar{Y}_{p-1}, \bar{Y} \right) ^T \in \mathbb {R}^{2p-1} \text {.} \end{aligned}$$

Upon assuming the normal-means model introduced in Sect. 3, under the alternative $\text {H}_1$ with $h = h_0 / \sqrt{n}$, as $n \rightarrow \infty $,

$$\begin{aligned} \sqrt{n} \cdot T_{\text {all}} \overset{\text {d}}{\rightarrow } T_{\text {all}}^{\infty }, \end{aligned}$$

with

$$\begin{aligned} T_{\text {all}}^{\infty } = \begin{pmatrix} G_1 + R + h_0 q_1 \\ G_2 + R + h_0 q_1 \\ \cdots \\ G_{p-1} + R + h_0 q_1 \\ H_1 + R + h_0 \\ H_2 + R \\ \cdots \\ H_{p-1} + R \\ R \end{pmatrix} \in \mathbb {R}^{2p-1}\text {,} \end{aligned}$$

where $G:= (G_1, G_2, \dots , G_{p-1})$ and $H:= (H_1, H_2, \dots , H_{p-1})$ both follow the same $(p-1)$-dimensional multivariate Gaussian distribution $\mathcal {N} \left( 0, \Sigma \right) $ and R is a standard normal random variable. Note that $\Sigma $ was defined in the statement of Theorem A.1. Moreover, G, H and R are independent.

Remark 1

Roughly speaking, after removing means, R captures the randomness of $\textbf{Y}$ being sampled from its marginal distribution; H captures the randomness of sampling $\textbf{X}$ conditioning on $\textbf{Y}$; lastly, G captures the randomness of resampling $\tilde{{\textbf {X}}}$ given $\textbf{Y}$.

Remark 2

We also note that we do not include characterizing the distribution of $\tilde{\bar{Y}}_{p}$ or $\bar{Y}_{p}$ to avoid stating the convergence in terms of a degenerate multivariate Gaussian distribution since $\bar{Y}_p$ is a deterministic function given $\bar{Y}$ and the remaining $p-1$ means of the other arms.

Proof of Lemma E.2

We first characterize the conditional distribution of $\tilde{\bar{Y}}_j$. For any $j \in \{1,2,\dots , p \}$,

$$\begin{aligned} \begin{aligned} \tilde{\bar{Y}}_{j}&:= \frac{\sum _{i=1}^n Y_i \textbf{1}_{\tilde{X}_{i} = j}}{\sum _{i=1}^n\textbf{1}_{\tilde{X}_{i} = j}} \\&= \frac{1}{\sqrt{n}} \left[ \frac{1}{q_j} \frac{ \sum _{i=1}^n Y_i \left( \textbf{1}_{\tilde{X}_{i} = j} - q_j\right) }{ \sqrt{n} } + \frac{ \sum _{i=1}^{n} Y_i}{ \sqrt{n} } \right] \frac{q_j n }{\sum _{i=1}^n\textbf{1}_{\tilde{X}_{i} = j}} \text {.} \end{aligned} \end{aligned}$$

By central limit theorem, since $\text {Var} \left( Y_i (\textbf{1}_{\tilde{X}_i = j} - q_j ) \right) \rightarrow q_j (1 - q_j)$ as $n \rightarrow \infty $,

$$\begin{aligned} \frac{ \sum _{i=1}^n Y_i \left( \textbf{1}_{\tilde{X}_{i} = j} - q_j \right) }{ \sqrt{q_j (1 - q_j) n} } \overset{\text {d}}{\rightarrow } \mathcal {N}(0,1) \text {,} \end{aligned}$$

which together with Slutsky’s Theorem and the fact that $q_j n / \sum _{i=1}^n \textbf{1}_{\tilde{X}_i = j} \rightarrow 1$ almost surely gives,

$$\begin{aligned} J_{j,n}:= \sqrt{n} \tilde{\bar{Y}}_j - \frac{\sum _{i=1}^{n} Y_i}{\sqrt{n}} \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, \frac{v(q_j)}{q_j^2}\right) \text {,} \end{aligned}$$

where $v(q_j) = \text {Var}(\text {Bern}(\textrm{q}_{\textrm{j}})) = \text {Var}(\textbf{1}_{\tilde{X}_j = 1}) = q_j (1 - q_j)$. Additional to these one-dimensional asymptotic results, we can also derive their joint asymptotic distribution. Before moving forward, we define a few useful notations,

$$\begin{aligned} \textbf{J}_{-p,n}:= (J_{1,n}, J_{2,n}, \dots , J_{p-1,n})\in \mathbb {R}^{p-1} \text {,}\\ V_i:= \left( Y_i (\textbf{1}_{\tilde{X}_i = 1} - q_1), Y_i (\textbf{1}_{\tilde{X}_i = 2} - q_2), \cdots , Y_{i} (\textbf{1}_{\tilde{X}_i = p-1}-q_{p-1}) \right) \in \mathbb {R}^{p-1} \text {,}\\ \bar{\Sigma }_n:= \frac{1}{n} \sum _{i=1}^{n} \text {Var}(V_i) \text {,} \end{aligned}$$

and

$$\begin{aligned} \Sigma _0:= & {} \text {Var}\left( \left( \textbf{1}_{\tilde{X}_i = 1}, \textbf{1}_{\tilde{X}_i = 2},\cdots , \textbf{1}_{\tilde{X}_i = p-1} \right) \right) \nonumber \\= & {} \begin{bmatrix} v(q_1) &{} -q_1 q_2 &{} -q_1 q_3 &{} \cdots &{} -q_1 q_{p-1}\\ -q_1 q_2 &{} v(q_2) &{} - q_2 q_3 &{} \cdots &{} -q_2 q_{p-1} \\ -q_1 q_3 &{} - q_2 q_3 &{} v(q_3) &{} \cdots &{} -q_3 q_{p-1} \\ \cdots &{} \cdots &{} \cdots &{} \cdots &{} \cdots \\ -q_1 q_{p-1} &{} -q_2 q_{p-1} &{} -q_3 q_{p-1} &{} \cdots &{} v(q_{p-1}) \end{bmatrix}\text {.} \end{aligned}$$

(E5)

By multivariate Lindeberg-Feller CLT (see for instance Ash and Doleans-Dade 1999),

$$\begin{aligned} \sqrt{n} \bar{\Sigma }_n^{-1/2} \left( \bar{V} - \mathbb {E} \bar{V} \right) \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, I_{p-1} \right) \text {.} \end{aligned}$$

(E6)

which further gives

$$\begin{aligned} \sqrt{n} \left( \bar{V} - \mathbb {E} \bar{V} \right) \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, \Sigma _0 \right) \end{aligned}$$

because of

$$\begin{aligned} \lim _{n \rightarrow \infty }\bar{\Sigma }_n = \Sigma _0 \text {.} \end{aligned}$$

Therefore, we have

$$\begin{aligned} \textbf{J}_{-p,n} \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, \Sigma \right) \text {,} \end{aligned}$$

(E7)

where

$$\begin{aligned} \Sigma = D^{-1} \Sigma _0 D^{-1} \end{aligned}$$

with

$$\begin{aligned} D = \text {diag}(q_1,q_2,\cdots ,q_{p-1}) \in \mathbb {R}^{(p-1)\times (p-1)} \text {.} \end{aligned}$$

(E8)

Roughly speaking, this suggests that after removing the shared randomness induced by $\frac{\sum _{i=1}^n Y_i}{\sqrt{n}}$, all the $\sqrt{n} \tilde{\bar{Y}}_j$’s are asymptotically independent and Gaussian-distributed.

Next, we turn to $\bar{Y}_j$. Note that in this part we will view $X_i$ as generated from $F_{X \mid Y}$ after the generation of $Y_i$ according to its marginal distribution. The only difference in the observed test statistic and the above is that we have

$$\begin{aligned} X_i \mid Y_i \sim \mathcal {M}(q^{\star }_i) \end{aligned}$$

with $q^{\star }_i = (q_{i,1}^{\star },q_{i,2}^{\star },\cdots , q_{i,p}^{\star } )$ and

$$\begin{aligned} q_{i,j}^{\star } = \frac{q_j \mathcal {N} \left( Y_i;\frac{h_0}{\sqrt{n}} \textbf{1}_{j = 1},1 \right) }{\sum _{k = 1}^p q_k \mathcal {N} \left( Y_i;\frac{h_0}{\sqrt{n}} \textbf{1}_{k = 1},1 \right) } = \frac{\exp \left[ -\frac{1}{2} \left( Y_i - \frac{h_0}{\sqrt{n}}\textbf{1}_{j=1} \right) ^2 \right] }{ \sum _{k = 1}^{p} q_k \exp \left[ -\frac{1}{2} \left( Y_i - \frac{h_0}{\sqrt{n}}\textbf{1}_{k=1} \right) ^2 \right] } \end{aligned}$$

instead. Again, multivariate Lindeberg-Feller CLT gives

$$\begin{aligned} \sqrt{n} (\bar{\Sigma }_n^{\star })^{-1/2} \left( \bar{V}^{\star } - \mathbb {E} \bar{V}^{\star } \right) \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, I_{p-1} \right) \text {,} \end{aligned}$$

(E9)

with

$$\begin{aligned} V_i^{\star }:= \left( Y_i (\textbf{1}_{X_i = 1} - q_{i,1}^{\star }), Y_i ( \textbf{1}_{X_i = 2} - q_{i,2}^{\star }), \cdots , Y_{i} (\textbf{1}_{X_i = p-1} -q_{i,p-1}^{\star }) \right) \in \mathbb {R}^{p-1} \text {,}\\ \bar{\Sigma }_n^{\star } = \frac{1}{n} \sum _{i=1}^n \text {Var}\left( V_i^{\star } \right) \text {.} \end{aligned}$$

Note that, since

$$\begin{aligned} \lim _{n \rightarrow \infty } \text {Var} \left( Y_i( \textbf{1}_{X_i = j} - q_{i,j}^{\star }) \right) = q_j (1 -q_j) \end{aligned}$$

and

$$\begin{aligned} \lim _{n \rightarrow \infty } \text {Cov} \left( Y_i( \textbf{1}_{X_i = j_1} - q_{i,j_1}^{\star }),Y_i( \textbf{1}_{X_i = j_2} - q_{i,j_2}^{\star }) \right) = -q_{j_1} q_{j_2}, \end{aligned}$$

we have

$$\begin{aligned} \lim _{n \rightarrow \infty } \bar{\Sigma }_n^{\star } = \Sigma _0 \text {,} \end{aligned}$$

which further gives

$$\begin{aligned} \sqrt{n} \left( \bar{V}^{\star } - \mathbb {E} \bar{V}^{\star } \right) \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, \Sigma _0 \right) \text {.} \end{aligned}$$

(E10)

Similar to $\textbf{J}$’s, we define $\textbf{J}^{\star }$’s as well,

$$\begin{aligned} J_{j,n}^{\star }:= & {} \sqrt{n} \bar{Y}_j - \frac{\sum _{i=1}^{n} q_{i,j}^{\star } Y_i}{ q_j \sqrt{n}} = \frac{\sum _{i=1}^n Y_i \textbf{1}_{X_i = j}}{ q_j \sqrt{n}} - \frac{\sum _{i=1}^{n} q_{i,j}^{\star } Y_i}{ q_j \sqrt{n}} + o_p(1) \nonumber \\= & {} \frac{\sqrt{n} \left( \bar{V}^{\star } \right) _j}{q_j} + o_p(1)\text {.} \end{aligned}$$

and

$$\begin{aligned} \textbf{J}_{-p,n}^{\star }:= (J_{1,n}^{\star }, J_{2,n}^{\star }, \dots , J_{p-1,n}^{\star })\in \mathbb {R}^{p-1} \text {,} \end{aligned}$$

which together with Eq. E10 gives

$$\begin{aligned} \textbf{J}_{-p,n}^{\star } \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, \Sigma \right) \text {.} \end{aligned}$$

(E11)

Note that though Eq. E7 and Eq. E11 are almost exactly the same, it does not suggest $\bar{Y}_j$’s and $\tilde{\bar{Y}}_j$’s have the same asymptotic distribution, since the “mean ” parts that have been removed actually behave differently, namely $\frac{\sum _{i=1}^n Y_i}{\sqrt{n}}$ and $\frac{\sum _{i=1}^{n} q_{i,j}^{\star } Y_i}{ q_j \sqrt{n}}$, as demonstrated in Lemma E.3, Lemma E.4, Lemma E.5 and Lemma E.6. Roughly speaking, under this $\sqrt{n}$ scaling, the randomness that leads to the Gaussian noise part in CLT is the same across them as demonstrated in Eq. E7 and Eq. E11, but the Gaussian distribution they are converging to have different means.

Finally, following exactly the same logic, we can further derive the following joint asymptotic distribution of $\textbf{J}_{-p,n}$, $\textbf{J}_{-p,n}^{\star }$ and $\frac{\sum _{i=1}^n Y_i}{\sqrt{n}}$. Letting

$$\begin{aligned} \textbf{J}_{\text {ALL}}:= \left( \frac{\sum _{i=1}^n Y_i}{\sqrt{n}}, \textbf{J}_{-p,n}, \textbf{J}_{-p,n}^{\star } \right) \in \mathbb {R}^{2p-1}, \end{aligned}$$

we have

$$\begin{aligned} \textbf{J}_{\text {ALL}} \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, \Sigma _{\text {ALL}} \right) := \mathcal {N} \left( 0, \begin{bmatrix} 1 &{} 0 &{} 0 \\ 0 &{} \Sigma &{} 0 \\ 0 &{} 0 &{} \Sigma \end{bmatrix} \right) . \end{aligned}$$

$\square $

Lemma E.3

As $n \rightarrow \infty $,

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^{n} Y_i^2 \overset{\text {a.s.}}{\rightarrow } 1 \quad \text {and} \quad \frac{\sum _{i=1}^{n} Y_i}{\sqrt{n}} \overset{\text {d}}{\rightarrow } \mathcal {N}(h_0 q_1, 1)\text {.} \end{aligned}$$

Proof

By defining $E_i:= S_i W_i + (1 - S_i) G_i \sim \mathcal {N}(0,1)$, we have

$$\begin{aligned} Y_i = E_i + \frac{S_i h_0}{\sqrt{n}} \end{aligned}$$

Note that $E_i$ and $S_i$ are not independent. Thus,

$$\begin{aligned} \begin{aligned} \frac{1}{n} \sum _{i=1}^{n} Y_i^2&= \frac{1}{n} \sum _{i=1}^{n} \left( E_i + \frac{S_i h_0}{\sqrt{n}} \right) ^2\\&= \frac{1}{n} \sum _{i = 1}^{n} E_i^2 + \frac{1}{n^2} \sum _{i=1}^{n} S_i h_0 + \frac{1}{n^{3/2}} \sum _{i=1}^{n} 2 h_0 E_i S_i\\&\overset{\text {a.s.}}{\rightarrow } 1 \text {,} \end{aligned} \end{aligned}$$

since by Law of Large Numbers the last two terms will vanish asymptotically and the first term will converge to $\mathbb {E}(E_i^2) = 1$. Moreover,

$$\begin{aligned} \begin{aligned} \frac{\sum _{i=1}^n Y_i}{ \sqrt{n} }&= \frac{\sum _{i=1}^n E_i}{\sqrt{n}} + h_0 \frac{\sum _{i=1}^n S_i}{n} \\&\overset{\text {d}}{\rightarrow } \mathcal {N}\left( h_0 q_1, 1\right) \text {,} \end{aligned} \end{aligned}$$

where the last line is obtained by applying CLT to the first term and LLN to the second term. $\square $

Lemma E.4

As $n \rightarrow \infty $,

$$\begin{aligned} \frac{ \sum _{i=1}^{n} q_{i,1}^{\star } Y_i }{q_1\sqrt{n}} \overset{\text {d}}{\rightarrow } \mathcal {N} \left( q_1 h_0, 1 \right) \text {.} \end{aligned}$$

Proof

We first show

$$\begin{aligned} \lim _{n \rightarrow \infty } \mathbb {E} \left( \sqrt{n} q_{i,1}^{\star } Y_i \right) = h_0 \text {.} \end{aligned}$$

(E12)

Recall that $Y_i$ can be seen as a mixture of two normal distributions $\mathcal {N}(0,1)$ and $\mathcal {N} \left( \frac{h_0}{\sqrt{n}},1 \right) $ with weights $1-q_1$ and $q_1$. Thus, $\mathbb {E} \left( \sqrt{n} q_{i,1}^{\star } Y_i \right) $ is equal to

$$\begin{aligned}{} & {} \sqrt{n} \int _{\mathbb {R}} \frac{y q_1 e^{-(y - h_0/\sqrt{n})^2/2}}{ q_1 e^{-(y - h_0/\sqrt{n})^2 /2} + (1 - q_1) e^{-y^2/2}}\\{} & {} \quad \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h_0/\sqrt{n})^2/2} \right] \textrm{d}y\\ {}{} & {} \quad := A_0 + A_1 \text {.} \end{aligned}$$

Note that with a change of variable $h = h_0/\sqrt{n}$,

Similarly,

$$\begin{aligned} \lim _{n \rightarrow \infty } A_0 = h_0 q_1 (1 - q_1)^2\text {.} \end{aligned}$$

Equation E12 is thereby established. Then, we compute $ \lim _{n \rightarrow \infty }\text {Var}(q_{i,1}^{\star }Y_i)$ using the same strategy.

$$\begin{aligned}{} & {} \lim _{n \rightarrow \infty }\text {Var}(q_{i,1}^{\star }Y_i) \nonumber \\{} & {} \quad = \lim _{n \rightarrow \infty } \left\{ \mathbb {E} \left[ (q_{i,1}^{\star }Y_i)^2 \right] - \left[ \mathbb {E} (q_{i,1}^{\star }Y_i)\right] ^2 \right\} \nonumber \\{} & {} \quad = \lim _{n \rightarrow \infty } \mathbb {E} \left[ (q_{i,1}^{\star }Y_i)^2 \right] \nonumber \\{} & {} \quad = \lim _{n \rightarrow \infty } \int _{\mathbb {R}} y^2 \left[ \frac{ q_1 e^{-(y - h_0/\sqrt{n})^2/2}}{ q_1 e^{-(y - h_0/\sqrt{n})^2 /2} + (1 - q_1) e^{-y^2/2}} \right] ^2 \nonumber \\{} & {} \qquad \cdot \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h_0/\sqrt{n})^2/2} \right] \textrm{d}y \nonumber \\{} & {} \quad = \int _{\mathbb {R}} \lim _{h \rightarrow 0} \bigg \{ y^2 \left[ \frac{ q_1 e^{-(y - h)^2/2}}{ q_1 e^{-(y - h)^2 /2} + (1 - q_1) e^{-y^2/2}} \right] ^2 \nonumber \\{} & {} \qquad \cdot \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y -h)^2/2} \right] \bigg \}\textrm{d}y \nonumber \\{} & {} \quad = \int _{\mathbb {R}} q_1^2 \frac{e^{-y^2/2}}{\sqrt{2 \pi }} \textrm{d} y \nonumber \\{} & {} \quad = q_1^2 \text {.} \end{aligned}$$

(E13)

Combining Eq. E12 and Eq. E13, the lemma is thus established by central limit theorem. $\square $

Following exactly the same logic, we have the following parallel lemma for $j \ne 1$.

Lemma E.5

For $j \ne 1$, as $n \rightarrow \infty $,

$$\begin{aligned} \frac{ \sum _{i=1}^{n} q_{i,j}^{\star } Y_i }{q_j\sqrt{n}} \overset{\text {d}}{\rightarrow } \mathcal {N} \left( 0, 1 \right) \end{aligned}$$

Proof

We first show

$$\begin{aligned} \lim _{n \rightarrow \infty } \mathbb {E} \left( \sqrt{n} q_{i,j}^{\star } Y_i \right) = 0 \text {.} \end{aligned}$$

Again, recall that $Y_i$ can be seen as a mixture of two normal distributions $\mathcal {N}(0,1)$ and $\mathcal {N} \left( \frac{h_0}{\sqrt{n}},1 \right) $ with weights $1-q_1$ and $q_1$. Thus, $\mathbb {E} \left( \sqrt{n} q_{i,j}^{\star } Y_i \right) $ is equal to

$$\begin{aligned}&\sqrt{n} \int _{{\mathbb {R}}} \frac{y q_j e^{-y^2/2}}{ q_1 e^{-(y - h_0/\sqrt{n})^2 /2} + (1 - q_1) e^{-y^2/2}} \\&\quad \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h_0/\sqrt{n})^2/2} \right] \textrm{d}y \\&:= B_0 + B_1 \text {.} \end{aligned}$$

With a change of variable $h = h_0/\sqrt{n}$, we have

Similarly,

$$\begin{aligned} \lim _{n \rightarrow \infty } B_0 = -h_0 q_j q_1 ( 1 - q_1)\text {.} \end{aligned}$$

Finally, we have $\lim _{n \rightarrow \infty }\text {Var}(q_{i,1}^{\star }Y_i) = q_j^2$ as well, which by CLT finishes the proof. $\square $

We can further write down their asymptotic joint distribution. We note that $q_{i,j}^{\star } = \frac{q_j}{q_2} q_{i,2}^{\star }$ deterministically for $j>2$; thus, it suffices to only include $j = 1,2$ in the joint asymptotic distribution.

Lemma E.6

As $n \rightarrow \infty $,

$$\begin{aligned} \left( \frac{\sum _{i=1}^n Y_i}{ \sqrt{n}}, \frac{ \sum _{i=1}^{n} q_{i,1}^{\star } Y_i }{q_1\sqrt{n} }, \frac{ \sum _{i=1}^{n} q_{i,2}^{\star } Y_i }{q_2\sqrt{n} } \right) \overset{\text {d}}{\rightarrow } \mathcal {N} \left( \mu _3,\Sigma _3 \right) \text {,} \end{aligned}$$

where

$$\begin{aligned} \mu _3 = \left( h_0 q_1, h_0 q_1(2 - q_1), h_0 q_1(1 - q_1)\right) ^T \in \mathbb {R}^3, \end{aligned}$$

and $\Sigma _3 \in \mathbb {R}^{3\times 3}$ is equal to

$$\begin{aligned} \begin{bmatrix} 1 &{} 1 &{} 1\\ 1 &{} 1 &{} 1\\ 1 &{} 1 &{} 1 \end{bmatrix} \text {.} \end{aligned}$$

In other words, asymptotically these three random variables are completely linearly correlated.

Proof

By Lemma E.4, it suffices to show

$$\begin{aligned} \begin{aligned} \lim _{n \rightarrow \infty } \text {Cor} \left( \frac{\sum _{i=1}^n Y_i}{ \sqrt{n}}, \frac{\sum _{i=1}^{n} q_{i,1}^{\star } Y_i }{q_1\sqrt{n} }\right)&= \lim _{n \rightarrow \infty } \text {Cor} \left( \frac{\sum _{i=1}^n Y_i}{ \sqrt{n}}, \frac{ \sum _{i=1}^{n} q_{i,2}^{\star } Y_i }{q_2\sqrt{n} }\right) \\&= \lim _{n \rightarrow \infty } \text {Cor} \left( \frac{ \sum _{i=1}^{n} q_{i,1}^{\star } Y_i }{q_1\sqrt{n} }, \frac{ \sum _{i=1}^{n} q_{i,2}^{\star } Y_i }{q_2\sqrt{n} } \right) = 1 \text {,} \end{aligned} \end{aligned}$$

which can be established by the following three displays. First,

$$\begin{aligned} \begin{aligned}&\lim _{n \rightarrow \infty } \text {Cov} \left( Y_i, q_{i,1}^{\star } Y_i \right) = \lim _{n \rightarrow \infty } \mathbb {E} \left( Y_i \cdot q_{i,1}^{\star } Y_i\right) \\&\quad = \lim _{n \rightarrow \infty } \int _{\mathbb {R}} \bigg \{ \frac{y^2 q_1 e^{-(y -h_0/\sqrt{n})^2/2}}{ q_1 e^{-(y - h_0/\sqrt{n})^2 /2} + (1 - q_1) e^{-y^2/2}} \\&\qquad \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h_0/\sqrt{n})^2/2} \right] \bigg \} \textrm{d}y \\&\quad = \lim _{h \rightarrow 0} \int _{\mathbb {R}} \frac{y^2 q_1 e^{-(y - h)^2/2}}{ q_1 e^{-(y -h)^2 /2} + (1 - q_1) e^{-y^2/2}}\\&\qquad \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h)^2/2} \right] \textrm{d}y \\&\quad = \int _{\mathbb {R}} \lim _{h \rightarrow 0} \left\{ \frac{y^2 q_1 e^{-(y - h)^2/2}}{ q_1 e^{-(y -h)^2 /2} + (1 - q_1) e^{-y^2/2}} \right. \\&\qquad \left. \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h)^2/2} \right] \right\} \textrm{d}y \\&\quad = q_1 \int _{\mathbb {R}} y^2 \frac{e^{-y^2/2}}{\sqrt{2 \pi }}\textrm{d}y \\&\quad = q_1 \text {;} \end{aligned} \end{aligned}$$

secondly,

$$\begin{aligned} \begin{aligned}&\lim _{n \rightarrow \infty } \text {Cov} \left( Y_i, q_{i,2}^{\star } Y_i \right) = \lim _{n \rightarrow \infty } \mathbb {E} \left( Y_i \cdot q_{i,2}^{\star } Y_i\right) \\&\quad = \lim _{n \rightarrow \infty } \int _{\mathbb {R}} \bigg \{ \frac{y^2 q_2 e^{-y^2/2}}{ q_1 e^{-(y - h_0/\sqrt{n})^2 /2} + (1 - q_1) e^{-y^2/2}} \\&\qquad \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h_0/\sqrt{n})^2/2} \right] \bigg \} \textrm{d}y \\&\quad = \lim _{h \rightarrow 0} \int _{\mathbb {R}} \frac{y^2 q_2 e^{-y^2/2}}{ q_1 e^{-(y -h)^2 /2} + (1 - q_1) e^{-y^2/2}}\\&\qquad \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h)^2/2} \right] \textrm{d}y \\&\quad = \int _{\mathbb {R}} \lim _{h \rightarrow 0} \left\{ \frac{y^2 q_2 e^{-y^2/2}}{ q_1 e^{-(y -h)^2 /2} + (1 - q_1) e^{-y^2/2}}\right. \\&\qquad \left. \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h)^2/2} \right] \right\} \textrm{d}y \\&\quad = q_2 \int _{\mathbb {R}} y^2 \frac{e^{-y^2/2}}{\sqrt{2 \pi }}\textrm{d}y \\&\quad = q_2 \text {;} \end{aligned} \end{aligned}$$

and finally

$$\begin{aligned} \begin{aligned}&\lim _{n \rightarrow \infty } \text {Cov} \left( q_{i,1}^{\star } Y_i, q_{i,2}^{\star } Y_i \right) \\&\quad = \lim _{n \rightarrow \infty } \mathbb {E} \left( q_{i,1}^{\star } q_{i,2}^{\star } Y_i^2\right) \\&\quad = \lim _{n \rightarrow \infty } \int _{\mathbb {R}} \left\{ \frac{y^2 q_1 q_2 e^{-(y -h_0/\sqrt{n})^2/2} e^{-y^2/2}}{ \left[ q_1 e^{-(y - h_0/\sqrt{n})^2 /2} + (1 - q_1) e^{-y^2/2} \right] ^2}\right. \\&\qquad \left. \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h_0/\sqrt{n})^2/2} \right] \right\} \textrm{d}y \\&\quad = \lim _{h \rightarrow 0} \int _{\mathbb {R}} \frac{y^2 q_1 q_2 e^{-(y - h)^2/2} e^{-y^2/2}}{ \left[ q_1 e^{-(y -h)^2 /2} + (1 - q_1) e^{-y^2/2} \right] ^2 } \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi } } e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h)^2/2} \right] \textrm{d}y \\&\quad = \int _{\mathbb {R}} \lim _{h \rightarrow 0} \left\{ \frac{y^2 q_1q_2 e^{-(y - h)^2/2} e^{-y^2/2}}{ \left[ q_1 e^{-(y -h)^2 /2} + (1 - q_1) e^{-y^2/2} \right] ^2} \left[ (1 - q_1) \frac{1}{\sqrt{2 \pi }} e^{-y^2/2} + q_1 \frac{1}{\sqrt{2 \pi }} e^{-(y - h)^2/2} \right] \right\} \textrm{d}y \\&\quad = q_1 q_2 \int _{\mathbb {R}} y^2 \frac{e^{-y^2/2}}{\sqrt{2 \pi }}\textrm{d}y \\&\quad = q_1 q_2 \text {.} \end{aligned} \end{aligned}$$

$\square $

Appendix F: Additional simulations in normal-means model

To show that our results presented in Sect. 3.3 are not sensitive to the initially chosen adaptive parameters and to also further optimize for multiple adaptive procedures A as shown in Algorithm 1, we create Fig. 5. Figure 5 shows the power of the ART using different combinations of the adaptive parameters, $\epsilon $ and reweighting value $t_0$, in three different scenarios of p and $h_0$.

Figure 5 shows that an adaptive procedure with exploration parameter $\epsilon = 0.7$ seems to be a favorable choice across different signal strengths. Additionally, we find that the optimal reweighting parameter t can be different across different scenarios but does not seem to matter largely across the different scenarios. We find that our initially chosen parameter of $\epsilon = 0.5$ in Sect. 3.3 was not necessarily the most optimal choice, demonstrating the robustness of the results presented in Sect. 3.3

Appendix G: Details of ART in conjoint analysis

In this section, we first give further details of the ART used in Sect. 4 along with some additional simulation results.

1.1 G.1 Simulation setup

For our simulation setup, (X, Z) each contain one factor with four levels, i.e., $X_t^L, X_t^R,Z_t^L, Z_t^R$ take values 1, 2, 3, 4. The response model follows a logistic regression model with main effects and interactions on only one specific combination,

$$\begin{aligned} \Pr (Y_{t}&= 1 \mid X_t, Z_t) = \text {logit}^{-1}\bigg [\beta _{X}\textbf{1}\{X_t^L = 1, X_t^R \ne 1\} - \beta _{X}\textbf{1}\{X_t^L \ne 1, X_t^R = 1\} \\&\quad + \beta _{Z}\textbf{1}\{Z_t^L = 1, Z_t^R \ne 1\} - \beta _{Z}\textbf{1}\{Z_t^L \ne 1, Z_t^R = 1\} \\&\quad + \beta _{XZ}\textbf{1}\{X_t^L = 1, Z_t^L = 2, X_t^R \ne 1, Z_t^R \ne 2 \} \\&\quad - \beta _{XZ}\textbf{1}\{X_t^L \ne 1, Z_t^L \ne 2, X_t^R = 1, Z_t^R = 2 \} \bigg ], \end{aligned}$$

where the first four indicators force main effects $\beta _X, \beta _Z$ of X and Z, respectively, on the first levels of each factor and the last two indicators force an interaction effect $\beta _{XZ}$ between the first and second level of factors X and Z. For example, $\textbf{1}\{X_t^L = 1, Z_t^L = 2, X_t^R \ne 1, Z_t^R \ne 2 \}$ is one if the left profile values of (X, Z) are (1, 2), respectively, but the right profile values of (X, Z) are not (1, 2) simultaneously. We note that the interaction indicator is still one if $(X_t^L, Z_t^L) = (1,2)$ and $(X_t^R, Z_t^R) = (1, 3)$ as long as both $(X_t^L, Z_t^L)$ and $(X_t^R, Z_t^R)$ are not (1, 2) simultaneously. For the left plot of Fig. 6, $\beta _X = \beta _Z = 0.6$ while $\beta _{XZ} = 0.9$ while we vary the sample size in the x-axis. For the right plot of Fig. 6, the interaction $\beta _{XZ} = 0$ while we vary $\beta _X = \beta _Z = (0, 0.3, 0.6, 0.9, 1.2)$ in the x-axis with a fixed sample size of $n = 1,000$. Lastly, our response model assumes “no profile order effect” since all main and interaction effects are repeated symmetrically for the right and left profile (except we shift the sign because $Y = 1$ refers to the left profile being selected).

1.2 G.2 Adaptive procedure and test statistic

We first give a detailed description of our adaptive procedure and then formally define the test statistic used in Sect. 4.

We define $X_t \sim \text {Multinomial}(p_{t, 1}^X, p_{t, 2}^X, \dots , p_{t, K^2}^X)$, where $p_{t, j}^X$ represents the probability of sampling the jth arm (arm refers to each unique combination of left and right factor levels) out of $K^2$ possible arms and K is the total levels of X. For example, in our simulation setup $K = 4$ and there are 16 possible arms, (1, 1), (1, 2), etc., and $p_{t,j}^Z$ is defined similarly. The uniform iid sampling procedure pulls each arm with equal probability, i.e., $p_{t,j}^X = \frac{1}{K^2}, p_{t,j}^Z = \frac{1}{L^2}$ for every j and L is the total number of factor levels for factor Z.^{Footnote 4} Although we present our adaptive procedure when Z contains only one other factor (typical conjoint analysis have 8–10 other factors), our adaptive procedure loses no generality in higher dimensions of Z.

We now propose the following adaptive procedure that adapts the sampling weights of $p_{t, j}^X, p_{t, j}^Z$ at each time step t in the following way:

$$\begin{aligned} p_{t,j}^X \propto |\bar{Y}_{j,t}^X - 0.5 |+ |N(0, 0.01^2) |, \quad p_{t,j}^Z \propto |\bar{Y}_{j,t}^Z - 0.5 |+ |N(0, 0.01^2) |,\qquad \end{aligned}$$

(G14)

where $\bar{Y}_{j,t}^X$ denotes the sample mean of $Y_1, Y_2, \dots , Y_{t-1}$ for arm j in variable X, $\bar{Y}_{j,t}^Z$ is defined similarly, and $N(0, 0.01^2)$ denotes a Gaussian random variable with mean zero and variance $0.01^2$ (the two Gaussians in Eq. (G14) are drawn independently). Equation (G14) samples more from arms that look like signal (further away from 0.5). We add a slight perturbation in case $\bar{Y}_{j,t}^X$ is exactly equal to 0.5 at any time point t to discourage an arm from having zero probability to be sampled.

With this reweighting procedure, we build our adaptive procedure. Just like Definition 3.2, we also have an $\epsilon $ adaptive parameter that denotes the beginning $[n\epsilon ]$ samples that are used for “exploration” by using the typical uniform iid sampling procedure. In the remaining samples, we adapt by changing the weights according to Equation (G14). This adaptive sampling procedure immediately satisfies Assumption 1 and also Assumption 2 since each variable only looks at its own history and previous responses. Algorithm 2 summarizes the adaptive procedure.

We now give the test statistic under consideration. Although Ham et al. (2022) considered a complex Hierarchical Lasso model to capture all second-order interactions, we consider a simple cross-validated Lasso logistic test statistic that fits a Lasso logistic regression of $\textbf{Y}$ with main effects of $\textbf{X}$ and $\textbf{Z}$ and their interactions due to the simplicity of this simulation setting. This leads to the following test statistic:

$$\begin{aligned} T^{\text {lasso}}(\textbf{X}, \textbf{Z}, \textbf{Y}) = \sum _{k = 1}^{K - 1} |{\hat{\beta }}_k |+ \sum _{k = 1}^{K - 1} \sum _{l = 1}^{L-1} |{\hat{\gamma }}_{kl} |, \end{aligned}$$

(G15)

where ${\hat{\beta }}_k$ denotes the estimated main effects for level k out of K levels of X (one is held as baseline) and ${\hat{\gamma }}_{kl}$ denotes the estimated interaction effects for level k of X with level l of L total levels of Z.

This test statistic also imposes the “no profile order effect” constraints, i.e., we do not separately estimate coefficients for the left and right profiles to increase power. When fitting a Lasso logistical regression of $\textbf{Y}$ with main effects and interaction of $(\textbf{X}, \textbf{Z})$, we obtain a separate effect for both the left and right effects. Since the “no profile order effect” constraints the left and right effects to be similar, we formally impose the following constraints:

$$\begin{aligned} \begin{aligned} {\hat{\beta }}_{k} \ {}&= \ {\hat{\beta }}_{k}^{L} \ = \ - {\hat{\beta }}_{k}^{R}, \quad {\hat{\gamma }}_{kl} \ = \ {\hat{\gamma }}_{kl}^{L} \ = \ -{\hat{\gamma }}_{kl}^{R}, \end{aligned} \end{aligned}$$

(G16)

where the superscripts L and R denote the left and right profile effects, respectively. To incorporate this symmetry constraint, we split our original $\mathbb {R}^{n \times (4 + 1)}$ data matrix $(\textbf{X},\textbf{Z}, \textbf{Y})$ into a new data matrix with dimension $\mathbb {R}^{2n \times (2 + 1)}$, where the first n rows contain the values for the left profile (and the corresponding Y) and the next n rows contain the values for the right profile with new response $1-Y$, Ham et al. (2022) shows that this formally imposes the constraints in Eq. (16) by destroying any profile order information in the new data matrix.

1.3 G. 3 Simulation results

We first compare the power of our adaptive procedure stated in Algorithm 2 with the iid setting where each arm for X and Z are drawn uniformly at random under the simulation setting described in Appendix G.1. We empirically compute the power as the proportion of 1000 Monte Carlo p values less than $\alpha = 0.05$.

For the left panel of Fig. 6, we increase sample size when there exist both main effects and interaction effects of X. More specifically, we vary our sample size $n = (450, 600, 750, 1,000, 1,300)$ while fixing the main effects of X and Z at 0.6 and a stronger interaction effect at 0.9 (these refer to the coefficients of the logistic response model defined in Appendix G). For the right panel of Fig. 6, we increase the main effects of X and Z with no interaction effect and a fixed sample size at $n = 1,000$. We also vary the exploration parameter $\epsilon $ in Algorithm 2 to $\epsilon = 0.25, 0.5, 0.75$.

Both panels of Fig. 6 show that the power of the ART with the proposed adaptive sampling procedure is uniformly greater than that of the CRT with a typical uniform iid sampling procedure (green). For example when $n = 1,000$ in the left panel, there is a difference in 8.5 percentage points (59% versus 67.5%) between the iid sampling procedure and the adaptive sampling procedure with $\epsilon = 0.5$ (red). When the main effect is as strong as 1.2 in the right panel, there is a difference in 24 percentage points (57% versus 81%) between the iid sampling procedure and the adaptive sampling procedure with $\epsilon = 0.5$. Additionally, when the main effect is 0 in the right panel, thus under $H_0$, the power of all methods, as expected, has type 1 error control as the power for all methods is near $\alpha = 0.05$ (dotted black horizontal line).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ham, D.W., Qiu, J. Hypothesis testing in adaptively sampled data: ART to maximize power beyond iid sampling. TEST 32, 998–1037 (2023). https://doi.org/10.1007/s11749-023-00861-2

Download citation

Received: 11 December 2022
Accepted: 10 April 2023
Published: 02 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11749-023-00861-2

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hypothesis testing in adaptively sampled data: ART to maximize power beyond iid sampling

Abstract

Access this article

Similar content being viewed by others

Computing the Performance of a New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards

Confidence intervals for single-case effect size measures based on randomization test inversion

Nonparametric meta-analysis for single-case research: Confidence intervals for combined effect sizes

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Appendix A: Asymptotic results for the normal-means model

Theorem A.1

Theorem A.2

Appendix B: Multiple testing

Assumption 2

Appendix C: Discussion of the natural adaptive resampling procedure

Appendix D: Proof of main results presented in Sect. 2

Proof of Theorem 2.3

Proof of Theorem 2.4

Appendix E: Proof of results presented in Appendix A

Lemma E.1

Lemma E.2

Remark 1

Remark 2

Proof of Lemma E.2

Lemma E.3

Proof

Lemma E.4

Proof

Lemma E.5

Proof

Lemma E.6

Proof

Appendix F: Additional simulations in normal-means model

Appendix G: Details of ART in conjoint analysis

1.1 G.1 Simulation setup

1.2 G.2 Adaptive procedure and test statistic

1.3 G. 3 Simulation results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation