A new inferential approach for response-adaptive clinical trials: the variance-stabilized bootstrap

This paper discusses disadvantages and limitations of the available inferential approaches in sequential clinical trials for treatment comparisons managed via response-adaptive randomization. Then, we propose an inferential methodology for response-adaptive designs which, by exploiting a variance stabilizing transformation into a bootstrap framework, is able to overcome the above-mentioned drawbacks, regardless of the chosen allocation procedure as well as the desired target. We derive the theoretical properties of the suggested proposal, showing its superiority with respect to likelihood, randomization and design-based inferential approaches. Several illustrative examples and simulation studies are provided in order to confirm the relevance of our results.

clinical trials half the patients will be randomized to a potentially ineffective or harmful treatment. This dilemma becomes more acute in the context of grave or emerging novel infectious diseases. Motivated by the ethical demand of individual care, in the last two decades there has been an increasing attention in the literature on response-adaptive (RA) designs.
By using the information provided by earlier responses and past assignments, RA procedures are sequential rules in which the treatment allocation probabilities change in order to favour at each step the treatment that appears to be superior and, asymptotically, to reach a desired treatment allocation proportion-the so-called target-representing a valid trade-off between ethics and inference (see, e.g., Rosenberger et al. (2001) and Baldi Antognini and Giovagnoli (2010)). Indeed, since the ethical goal of maximizing the subjects' care often conflicts with the statistical one of drawing correct inferential conclusions about the identification of the better treatment and its relative superiority, the targets generally depend on the unknown treatment effects: although a priori unknown, they can be approached by RA procedures that estimate sequentially the parameters to progressively converge to the chosen target [see for a review Atkinson and Biswas (2014), Baldi Antognini and Giovagnoli (2015) and Rosenberger and Lachin (2015)]. Some examples are the Sequential Maximum Likelihood design (Melfi and Page 2000), the Doubly-adaptive Biased Coin design (Eisele 1994) and the Efficient Randomized Adaptive DEsign (ERADE) introduced by Hu et al. (2009) in order to improve the convergence to the chosen target.
Although the adaptation process induces a complex dependence structure between the outcomes, several authors provided the conditions under which the classical asymptotic likelihood-based inference is still valid for RA procedures [see, e.g., Durham et al. (1997) and Melfi and Page (2000)]. In particular, let us assume that the observations relative to either treatment-say A and B-are iid belonging to the exponential family parameterized in such a way that θ j ∈ Θ ⊆ R denotes the mean effect of treatment j, while v j = v(θ j ) > 0 is the corresponding variance ( j = A, B). Special cases of practical relevance in the clinical context for modeling the primary endpoint, that in what follows will be referred to as statistical models, are binary (with θ j ∈ (0; 1), v(θ j ) = θ j (1 − θ j )) and Poisson (θ j ∈ R + , v(θ j ) = θ j ) distributions for dichotomous and count data, respectively, while the normal model (with θ j ∈ R and v(θ j ) = v j independent from θ j ) is also encompassed for continuous responses as well as the exponential one (θ j ∈ R + , v(θ j ) = θ 2 j ) for survival outcomes. The inferential goal usually consists in estimating/testing the superiority of A wrt B and, therefore, interest lies in the treatment contrast ϑ = θ A − θ B , while θ B is usually regarded as a nuisance parameter, so from now on we take into account the model re-parameterization (θ A , θ B ) → (ϑ, θ B ). Let π n be the allocation proportion to A (respectively, 1 − π n to B) after n steps, if the RA design is chosen such that lim n→∞ π n = ρ(ϑ, θ B ) ∈ (0; 1) a.s. with ρ(·) continuous, then the applicability of standard asymptotic inference is ensured. Generally satisfied by RA rules proposed in the literature, this crucial condition prescribes that the target ρ must be a non-random quantity different from 0 and 1, to avoid possible degeneracy of likelihood methods. Moreover, by assuming (without loss of generality) that high responses are preferable for patients' care, an additional common assumption is: ρ is monotonically increasing in ϑ with ρ(0, θ B ) = 1/2.
For example, adopting the Play-the-Winner rule proposed by Zelen (1969) for binary trials, the allocation proportion of A converges to ρ PW (ϑ, θ B Whereas, other targets proposed in the literature depend only on the treatment difference: for instance, in the case of normal homoscedastic trials, Bandyopadhyay and Biswas (2001) and Atkinson and Biswas (2005) is the cumulative distribution function of the standard normal and T > 0 a tuning parameter.
For moderate-large samples, namely the most representative framework in the context of phase-III clinical trials, several authors showed (both theoretically and via simulations) that the likelihood-based approach could present anomalies in terms of coverage probabilities of confidence intervals, as well as inflated type-I errors or inconsistency of Wald's test, especially when the chosen targets exhibit a strong ethical component (Rosenberger and Hu 1999;Yi and Wang 2011;Atkinson and Biswas 2014;Baldi Antognini et al. 2018a;Novelli and Zagoraiou 2019). To avoid these drawbacks, Wei (1988) and Rosenberger (1993) suggested to conduct randomizationbased inference for RA trials. Under this framework, the null hypothesis of equality of the two arms corresponds to an allocation in which the treatment assignments are unrelated to the responses, so the randomization test is carried out by computing the distribution of the allocations conditionally on the observed outcomes (that are treated as deterministic). Since the distribution of the test depends on the adopted RA procedure, exact results are quite few and, generally, p-values are computed via Monte Carlo methods. Following a design-based approach, Baldi Antognini et al. (2018b) recently introduced a test based on the treatment allocation proportion induced by a suitably chosen RA rule showing that, in some circumstances, this test could be uniformly more powerful than the Wald test.
After discussing drawbacks and limitations of the available inferential approaches, the aim of this paper is to provide a new inferential methodology for RA clinical trials by combining a variance stabilizing transformation with a bootstrap method. We derive the theoretical properties of the suggested proposal, showing that it is more accurate than the other approaches, regardless of the adopted RA rule as well as the chosen target. Several illustrative examples are provided for normal, binary, Poisson and exponential data. Starting from a discussion in Sect. 2 about the existing approaches, highlighting their inadequacy for RA clinical trials, Sect. 3 deals with the new variancestabilized bootstrap procedure and its theoretical properties. An extensive simulation study is carried out in Sect. 4 to confirm the relevance of our results, also comparing the performances of the newly introduced approach to those of other inferential methods. Finally, Sect. 5 deals with some concluding remarks.

Likelihood-based inference
Although for RA designs the MLEsθ n = (θ An ,θ Bn ) t of θ = (θ A , θ B ) remain the same as those of the non-sequential setting (i.e., the sample means), this is not true for their distribution due to the complex dependence structure generated by the adaptation process. However, the standard asymptotic inference is allowed for RA designs satisfying (1). Indeed, let and, due to the continuity of the target, So, lettingv jn s be consistent estimators of the treatment variances, thenσ 2 n is usually employed. Under H 0 , W n converges to the standard normal distribution and, due to the consistency ofσ 2 n , the power can be approximated by Even if condition (1) theoretically guarantees the applicability of likelihood inference, this approach may present critical drawbacks, in particular for targets characterized by a high ethical component. Indeed, as shown in Baldi Antognini et al. (2018a) and Novelli and Zagoraiou (2019), if ρ tends either to 0 or 1, the asymptotic variance of ϑ n tends to diverge. Therefore, the quality of the CLT approximation is compromised, leading to unreliable confidence intervals and inflated type-I errors. Furthermore, some targets (like, e.g., ρ N and ρ L ) could induce a consistent loss of inferential precision, since the Wald test becomes inconsistent and it displays a non-monotonic power.

Randomization-based inference
Randomization-also known as re-randomization-tests are a class of nonparametric procedures obtained by recomputing a test statistic D n (asθ n or other discrepancy measures between the two arms, like those based on ranks) over permutations of the data (Rosenberger and Lachin 2015). Taking into account the null hypothesis (under which the allocations are unrelated to the patients' outcomes), the procedure is carried out by considering the set of responses as fixed and deterministic values, and computing all the possible ways in which the subjects could have been assigned to the treatments. However, since the computation of all the treatment assignment permutations and their probabilities is not feasible, even for small or moderate sample sizes, in practice randomization tests are computed using Monte Carlo methods. In particular, the allocation sequence is generated L times and, for each sequence, the statistic of interest d l n is computed, obtaining {d l n , l = 1, . . . , L}. Then, a consistent estimator of the p-value is obtained by calculating the proportion of the generated sequences that yields a value of the test equal or more extreme than the value d n of the test statistic evaluated on the observed data. Then, the p-value can be approximated by the proportion of sequences where |d l n | ≥ |d n |, namelyP rand = L −1 L l=1 I(|d l n | ≥ |d n |), where I(·) is the indicator function, so that the test of level α rejects H 0 ifP rand is lower than the significance level. Analogously, the power of the randomization test can be approximated via Monte Carlo methods by repeating H times the above-mentioned procedure and computing the proportion of rejections (Beran 1986).
One of the main strengths of randomization tests consists in avoiding any parametric assumption on the population model; this makes them a valid alternative to the standard likelihood methods, especially when the conventional model assumptions may not hold or be verified (Rosenberger et al. 2019). However, the behavior of randomization tests strictly depends on the particular RA procedure that has been adopted and their applicability may be severely limited by the quite restricted specification of the null hypothesis being tested. For instance, if the chosen RA design depends only on the treatment effects, then the null hypothesis of randomization test actually corresponds to testing the equality of the effects, with an alternative that is naturally two-sided (i.e., the allocations depend on the treatment outcomes). Although these procedures have been also applied for the one-sided alternative H 1 : ϑ > 0, they are not suitable for a general hypothesis testing problem. For instance, assuming ρ PW for binary trials or ρ N for normal outcomes discussed above, a commonly used alternative H 1 : ϑ > δ for a prefixed minimum significant difference δ cannot be tested via re-randomization. Moreover, such an approach does not directly allow the construction of confidence intervals.

Design-based inference
Taking into account targets depending only on the treatment difference, namely ρ = ρ(ϑ), satisfying (1)-(2) with ρ(ϑ) = 1 − ρ(−ϑ) to treat the two arms symmetrically, Baldi Antognini et al. (2018b) have recently introduced a design strategy for normally response trials that overcomes some drawbacks of the Wald test. In particular, under condition (1), both ρ(θ n ) and the treatment allocation proportion π n are consistent estimators of ρ(ϑ). Thus, if we further assume ρ is twice continuously differentiable with bounded derivatives, adopting ERADE [or an asymptotically best RA procedure as defined by Zhang and Rosenberger (2006) } is the so-called Rao-Cramer lower bound and ρ is the derivative of ρ (the asymptotic normality follows from the Delta-method, provided that ρ (ϑ) = 0). Thus, letλ 2 n = [ρ (θ n )] 2 v An /π n +v Bn /(1 − π n ) be a consistent estimator of λ 2 , then C I (ρ(ϑ)) 1−α = (π n ± z 1−α/2λn / √ n) and, due to the monotonicity of ρ, the asymptotic confidence interval for ϑ could be derived by applying the inverse mapping ρ −1 to the endpoints of C I (ρ(ϑ)) 1−α . Analogously, testing the equality of the treatment effects is equivalent to testing H 0 : ρ(ϑ) = 1/2 (against H 1 : ρ(ϑ) > 1/2 or H 1 : ρ(ϑ) = 1/2, corresponding to H 1 : ϑ > 0 or H 1 : ϑ = 0, respectively). Under H 0 , the test statistic Z n = √ n (π n − 1/2)λ −1 n converges to the standard normal distribution, while given H 1 : ρ(ϑ) > 1/2, the power of the α-level test Z n can be approximated by Moreover, under some additional conditions on ρ, power (5) is monotonically increasing in ϑ and Z n tends to be more powerful than the Wald test. However, the major drawback of this approach is its strong dependence on the chosen target, which could significantly affect λ 2 through its ethical skew, leading to possibly inflated type-I errors. Indeed, by combining (1), (2), (4) and the symmetric structure of the target, which clearly limits the choice of the target as well as the values of the tuning parameter T , if present (λ 2 is strongly affected by ρ , which represents the ethical improvement of the chosen target, especially when ρ (0) tends to grow quickly).
These are the main reasons why the design-based test could present inflated type-I errors for several targets and some values of T . For instance, taking into account normal response trials, although ρ N and ρ L are twice differentiable with ρ N (0) = ρ L (0) = 0, these targets tend to be highly sensitive to small variations in the treatment difference ϑ around 0 (i.e., under H 0 ), especially for small values of T ; whereas the target is not twice differentiable at 0; moreover, ρ S (0) vanishes as T grows and tends to be unbounded as T → 0, so damaging the CLT approximation (as we will point out in Table 1).
Example 1 Figure 1 shows the simulated distributions of the allocation proportion π n under H 0 : ϑ = 0, adopting ρ S and ρ L with T ∈ {0.5, 1, 2}, obtained by simulating 100000 homoscedastic normally distributed trials with n = 250 using ERADE (with randomization parameter γ = 0.5). Adopting ρ L , for T = 0.5 the resulting distribution tends to be concentrated on the extremes, presenting peaks on 0 and 1, while for T ≥ 1 the asymptotic normality is preserved. Under ρ S instead, small values of T tend to both increase the variability of the distribution of π n and to accentuate the departure from normality; this effect is greatly mitigated for T > 1.
Test Z n could be naturally extended to a target ρ(ϑ, θ B ) depending on the nuisance parameter θ B by letting λ 2 = ∇ρ t M −1 ∇ρ and to other models belonging to the exponential family, as we will discuss in Sect. 4 for binary, Poisson and exponential outcomes.

The variance-stabilized bootstrap-t approach
In order to avoid the aforementioned drawbacks of both likelihood-based and designbased inference, also overcoming the limitations of randomization-based tests, we now propose a new inferential approach for RA procedures developed through a variance-stabilized bootstrap-t method (Tibshirani 1988;Efron and Tibshirani 1994). By mapping the statistic of interest via a variance stabilizing transformation and computing its bootstrap-t distribution, this proposal allows us to avoid the problems related to the instability of the asymptotic variance as well as the quality of the CLT approximation.
The main idea behind the variance stabilization is the following: let X be a random variable with expected value μ and variance ν = ν(μ), letting g(·) a regular transformation such that g (μ) = ν(μ) −1/2 , then the variance of g(X ) tends to be first-order constant, namely it is at least approximately independent on μ in a first-order Taylor expansion. Therefore, given a chosen target ρ, by applying such a variance stabilizing transformation to the estimated treatment differenceθ n , we are able to get over the possible degeneracy of its asymptotic variance σ 2 ρ . In particular, for every fixed θ B ∈ Θ (and v ∈ R + for normal homoscedastic outcomes), by letting σ 2 the α-level right-sided test consists in rejecting the null hypothesis H 0 : Notice that the transformation g(·) depends on the chosen target as well as on the statistical model through the variance function, and thus it could also depend on θ B and v; therefore, the estimation of the nuisance parameters is requested for computing the statistical test and from now on we letv n be a consistent estimator of v. The following Corollary presents the transformation g(·) and the corresponding test T n for the most common statistical models and for some selected targets. In particular, a widely used one is which corresponds to the Neyman allocation for exponential outcomes and to the E-optimal design for Poisson responses (also considered by Zhang and Rosenberger (2006) for normal trials with non-negative means and by Baldi Antognini and Giovagnoli (2010) for binary outcomes). (8):
Notice that for some targets, e.g., ρ PW , the transformation function g(·) is not available in closed form and it should be evaluated numerically using standard integration routines (like, e.g., integrate in R).
Assuming that the outcomes belong to the exponential family discussed in Sect. 1, the following results hold.

Theorem 1 The variance-stabilized test T n is consistent, and its power function is monotonically increasing in ϑ, regardless of the chosen target.
Proof Due to its definition, the variance stabilizing transformation g(·) is a continuous and monotonically increasing function and, therefore, the power of T n is increasing too. Furthermore, by noticing that lim ϑ→ϑ g(ϑ) = g(ϑ) > g(0), test T n is always consistent.

Theorem 2 If the target ρ is chosen such that
the variance-stabilized test T n is uniformly more powerful than Wald's test. Furthermore, condition (10) holds if σ 2 ρ (ϑ) is increasing for ϑ > 0.
Proof Condition (10) can be easily derived from the power function of test T n combined with (3). Moreover, the last statement follows from the mean value theorem; indeed, due to the continuity of σ ρ (·), there exists a given c ∈ [0; x] such that As an example, let us consider ρ L for normal homoscedastic data. Condition (10) simplifies to which can be verified by noticing that the left hand side is an increasing function for every x > 0 and it is equal to 0 for x = 0. As we will show in the following Corollary, for normal homoscedastic outcomes test T n is uniformly more powerful than Wald's test, regardless of the chosen target (see also Table 1). In general, however, the superiority of T n depends on the adopted target and the given statistical model.

Corollary 2
For normal homoscedastic outcomes, test T n is uniformly more powerful than W n regardless of the chosen target. Adopting ρ R for exponential data, as well as under ρ Z for Poisson trials, test T n is uniformly more powerful than W n .
In order to overcome possible problems related to the quality of the CLT approximation, we apply such a variance stabilizing transformation into a bootstrap framework. Since standard re-sampling techniques (like the nonparametric bootstrap) may not be suitable for non-exchangeable/dependent data, we suggest a parametric bootstrap that makes use of the estimated parameters and generates replicates of both the allocation sequence derived by the chosen RA rule and the corresponding outcomes, without re-sampling the observed data. Following the same arguments of Rosenberger and Hu (1999), who have derived bootstrap confidence intervals for adaptive designs, if the RA procedure satisfies condition (1), then the bootstrap method is still first-order consistent. Indeed, in this case the MLEs are consistent and asymptotically normal, so the first-order consistency of the bootstrap estimators follows directly [see the Appendix of Rosenberger and Hu (1999)]. Moreover, such a variance-stabilized bootstrap-t method has been proven to be transformation-respecting, second-order correct and accurate, providing also good performances in fairly general settings (DiCiccio and Efron 1996;Hall 2013).
More specifically, given a RA design fulfilling conditions (1)-(2), the proposed strategy is the following: 1. at the end of the trial with n subjects deriveθ n ; 2. generate B 1 replicates of the RA trial with size n usingθ n as underlying parameters, obtainingθ * i n and thenθ * i n , for i = 1, . . . , B 1 ; 3. for each i, generate B 2 replications of the trial usingθ * i n as underlying parameters and compute the bootstrap estimateν * i n of the variance of √ nθ * i n over the B 2 replicates, derivingν * i n (i = 1, . . . , B 1 ); 4. fit a curve to the points ( using a nonlinear regression technique-such as lowess running smoother (Cleveland 1979)-to estimate ν(·) and compute the variance stabilizing transformation g(x) = x ν(s) −1/2 ds by using a numerical integration technique; 5. generate B 3 new replicates of the trial usingθ n to obtainθ * j n ( j = 1, . . . , B 3 ) and then compute the (1−α)-percentile t * 1−α of the studentized distribution Let T * n be the bootstrap version of (7), given H 1 : ϑ > 0, the α-level test rejects H 0 when T * n > t * 1−α (the two-tailed alternative can be derived accordingly). Then, denoting by t * j n the test statistic calculated for the jth bootstrap replicate ( j = 1, . . . , B 3 ), the p-value can be approximated byP boot = B −1 3 B 3 j=1 I(t * j n ≥ t * n ), where t * n is the value of T * n evaluated on the observed data. Finally, the power of test T * n can be approximated via Monte Carlo methods by repeating H times steps 1−5 and computing the percentage of rejections (Beran 1986). As regards the construction of confidence intervals, by the inverse mapping g (−1) ,

Remark 1
The use of different sets of bootstrap replicates for the estimation of (i) the variance transformation g(·) (steps 2-3) and (ii) the percentile t * 1−α (step 5) is intended to limit the burden of computation required, reducing considerably the calculation wrt to the usual untransformed bootstrap-t method. Indeed, as shown by Tibshirani (1988), B 1 = 100 and B 2 = 25 are sufficient to reliably estimate g(·), while at least B 3 = 1000 is needed to derive t * 1−α . It is worth stressing that the implementation of our proposal is not time consuming: with a regular laptop, it takes about 1 second to perform a hypothesis test as well as to build a confidence interval.

A comparative simulation study
In this section, we compare the performances of the newly introduced test T * n with the ones of Wald's statistic W n , the design-based test Z n and the randomization test D n (usingθ n as discrepancy measure). In order to do so, we have performed a simulation study employing ERADE (γ = 0.5) with n = 250 and a starting sample of n 0 = 2 for each treatment. In the first scenario, the responses are assumed to be homoscedastic normally distributed with unknown common variance v = 1. Table 1 summarizes the results adopting targets ρ L and ρ S (with T = 0.5, 1 and 2), obtained with 100000 Monte Carlo replications of the trial for W n , Z n and D n , while we set B 1 = 300, B 2 = 100 and B 3 = 10000 for T * n . Because of its strong ethical skew, target ρ L induces an anomalous behavior of the power of W n , which tends to the significance level as ϑ grows (especially as the ethical skew increases, namely for T ≤ 1, when the power function rapidly vanishes); note that all the remaining tests are consistent. Whereas, adopting ρ S , the consistency of the Wald test is preserved, while Z n exhibits inflated type-I errors. In general, the new test T * n preserves the nominal type-I error and provides an improvement in inferential precision wrt to all the competitors. This is particularly true with ρ S : indeed for T = 0.5 the gain of power of T * n wrt to W n and D n is about 4% and 7%, respectively. The second scenario deals with binary trials: Table 2 describes the performance of the four tests adopting ρ PW and ρ R as θ B varies. While preserving the nominal type-I error, T * n shows the highest power in all the scenarios, with an improvement of about 8% wrt to D n and up to 4% − 5% wrt Z n and W n , respectively. Test Z n shows a slight inflation of type-I error for ρ PW and θ B = 0.7. It is worth stressing that T * n , Z n and D n confirm their consistency with all the adopted targets, while this is not true for Wald's test under ρ PW . Table 3 describes the simulation results obtained with exponential and Poisson data adopting ρ R and ρ Z , respectively. Under these scenarios, T * n confirms the good results in terms of power, with a gain up to 4% wrt D n and up to 2−3% wrt Z n and W n , respectively. Tests Z n and W n tend to perform quite similarly, while the randomization test D n exhibits the lowest inferential precision.
Taking now into account CIs, Table 4 compares the simulated C I (ϑ) 0.95 obtained in the case of normal homoscedastic trials (with v = 1) adopting ρ L and ρ S with ERADE (γ = 0.5) and n = 250, as ϑ and T vary. Here, Lower (L) and Upper (U) bounds are obtained by averaging the endpoints of the simulated trials. Under ρ L , for T = 2, all the considered approaches perform quite similarly, with an empirical coverage that increases as the empirical evidence increases. Although for ϑ ≤ 1.5 the endpoints obtained through the bootstrap procedure are close to the asymptotic likelihood-based ones, as ϑ grows the likelihood-based CIs tend to degenerate, while the bootstrap ones maintain their reliability with only a slight increase in their widths. Note that, due to the inverse-mapping, the applicability of the design-based CIs is severely limited: when the chosen target approaches 1 (i.e., for small values of T or when ϑ grows), the CIs for ρ often contain values outside (0; 1) and therefore the inverse-mapping cannot be properly applied (for this reason, we use the symbol − in Tables 4 and 5). This is particularly evident for T < 1 or ϑ > 1.5. Adopting ρ S instead, design-based CIs do not diverge but strongly undercover when ϑ = 0. Likelihoodbased and bootstrap-based CIs perform fairly well, with the latter displaying slightly asymmetric right endpoints. Following the same setting of the previous tables, Table 5 summarizes the simulated C I (ϑ) 0.95 obtained for binary trials with ρ PW and ρ R as ϑ and θ B vary. Bootstrapbased and likelihood-based CIs confirm their good performances with quite similar empirical coverage; bootstrap intervals are on average slightly less wider and right shifted. As previously discussed, the design-based CIs show an extremely unstable behavior, in particular when the targets approach 1 (i.e., as θ B grows for ρ PW or as θ B tends to 0 for ρ R ), also due to their dependence on the nuisance parameter. While the EC for the CIs of ρ is always close to its nominal value, the inverse-mapping transformation can cause either an undercoverage for ρ PW or an overcoverage for ρ R for the CIs of ϑ. Table 6 displays the simulated C I (ϑ) 0.95 obtained for exponential and Poisson outcomes adopting ρ R and ρ Z as ϑ and θ B vary. Bootstrap-based and likelihoodbased CIs perform fairly well, while the design-based CIs are, on average, slightly wider.
Finally, it is worth highlighting that our proposal exhibits good inferential performances also for small/medium sample sizes. In the same setting of the previous tables, Tables 7 and 8 summarize the results about the simulated power and C I (ϑ) 0.95 for n = 100, adopting ρ R . We set θ B = 0.1 for binary data, while for homoscedastic normal, exponential and Poisson responses θ B = 1. Note that now the sample size is reduced to the 40% of that of the previous tables, this clearly translates into lower power and wider confidence intervals. Nevertheless, T * n confirms its consistency, also preserving at the same time the type-I error, for all the considered models; moreover, the bootstrap-based CIs maintain their reliability in terms of both empirical coverage and interval width.

Discussion
In this paper, we propose a new inferential strategy for response-adaptive clinical trials based on the variance-stabilized bootstrap-t method. This is motivated by the fact that the available inferential approaches present several drawbacks, such as (i) inconsistency of Wald's test, local decreasing power and unreliable CIs for likelihood inference, (ii) reduction in the empirical coverage of CIs and inflated type-I errors for the design-based approach, (iii) unsuitability of randomized-based inference for general hypothesis testing problems. We derive the theoretical properties of the suggested methodology, showing that the degeneracy of the Fisher information is avoided, guaranteeing at the same time the consistency of the test as well as a monotonically increasing power function. In general, this proposal preserves the nominal type-I error, attenuates the dependence on the nuisance parameters and is more efficient than the other methods, regardless of the chosen RA rule as well as the adopted target and its ethical skew. By means of an extensive simulation study, we show that the new inferential strategy has very good performances in terms of power compared to the above-mentioned inferential approaches. In addition, the suggested bootstrap approach turns out to provide reliable confidence intervals in terms of both empirical coverage and interval width, avoiding   then the possible degeneracies and instability of the likelihood-based and the designbased approach, respectively. Moreover, our proposal exhibits good performances in terms of inferential accuracy even for small/medium sample sizes. Although in actual practice, the large majority of phase-III trials are planned for comparing K = 2 treatments, the case of K > 2 could be also of interest and now we briefly discuss a possible extension of the proposed methodology. Even if for two treatments the variance stabilizing transformation g is guaranteed (whose closed-form expression could or could not be available on the basis of the variance function and the chosen target), for several treatments this transformation does not exist in general. Let θ = (θ 1 , . . . , θ K ) t and v = (v 1 , . . . , v K ) t be the vectors of treatment effects and variances, respectively, while ρ(θ ) = (ρ 1 (θ), . . . , ρ K (θ )) t now denotes the target, namely ρ k (θ) is the target allocation of the kth treatment group (k = 1, . . . , K ) with 1 t K ρ(θ ) = 1 (where 1 K is the K −dim vector of ones). In this setting, the inferential focus is on the contrasts ϑ = A t θ where, considering without loss of generality the first treatment as the reference one, A t = [1 K −1 | − I K −1 ] (here I K −1 is the (K − 1)−dim identity matrix). After n steps, lettingθ n = (θ n1 , . . . ,θ nK ) t be the MLE of θ , if condition (1) holds for every treatment group, thenθ n is strongly consistent and asymptotically normal with √ n(θ n − θ ) d −→ N(0 K , M −1 ), where 0 K is the K -dim vector of zeros and M = diag (ρ k (θ )/v k ) k=1,...,K . Therefore, the MLEθ n = A tθ n is strongly consistent and asymptotically normal with √ n(θ n − ϑ) d −→ N(0 K −1 , A t M −1 A). From the multi-dimensional Delta-method, the problem now consists in finding a proper covariance stabilizing transformation, namely a function G : R K −1 → R K −1 stabilizing A t M −1 A, i.e., such that √ n(G(θ n ) − G(ϑ)) d −→ N(0 K −1 , I K −1 ). By letting J = (∂G/∂ x) x=ϑ be the Jacobian matrix of the partial derivatives of G evaluated at ϑ, then G is a covariance stabilizing transformation if and only if J t J = (A t M −1 A) −1 . Essentially, this corresponds to J = (A t M −1 A) −1/2 , namely a mapping G whose Jacobian is equal to a square root of the symmetric and positive definite matrix (A t M −1 A) −1 should be identified. Unfortunately, this transformation may not exist and it should be checked for any chosen model and target by applying standard matrix differential equations; however, the computational complexity grows extremely fast as K increases, leading to a very complicated programming except for K = 3, as discussed in Holland (1973).
Funding Open access funding provided by Alma Mater Studiorum-Università di Bologna within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.