A simple solution to the inadequacy of asymptotic likelihood-based inference for response-adaptive clinical trials

The present paper discusses drawbacks and limitations of likelihood-based inference in sequential clinical trials for treatment comparisons managed via Response-Adaptive Randomization. Taking into account the most common statistical models for the primary outcome—namely binary, Poisson, exponential and normal data—we derive the conditions under which (i) the classical confidence intervals degenerate and (ii) the Wald test becomes inconsistent and strongly affected by the nuisance parameters, also displaying a non monotonic power. To overcome these drawbacks, we provide a very simple solution that could preserve the fundamental properties of likelihood-based inference. Several illustrative examples and simulation studies are presented in order to confirm the relevance of our results and provide some practical recommendations.

1 Introduction ernment agencies and Health Authorities (CHMP 2007;FDA 2018). RAR procedures are sequential allocation rules in which the allocation probabilities change on the basis of earlier responses and past assignments; the aim is to balance the experimental goals of drawing correct inferential conclusions and caring about the welfare of each patient, the so-called individual-versus-collective ethics dilemma (for a recent review, see Hu and Rosenberger 2006;Atkinson and Biswas 2014;Baldi and Giovagnoli 2015;Rosenberger and Lachin 2015). A cornerstone example is the randomized Play-the-Winner (PW) suggested for binary trials (see, e.g., Wei and Durham 1978;Ivanova 2003). The peculiarity of the PW rule is that the allocation proportion of each of the two treatments converges to the relative risk of the other, so that (asymptotically) the majority of patients will receive the best treatment. Another example, for normal and survival outcomes, is the treatment effect mapping (Rosenberger 1993), where the assignments are based on a function that links the difference between the treatment effects to the ethical skew of the allocation probability (Rosenberger and Seshaiyer 1997;Bandyopadhyay and Biswas 2001;Atkinson and Biswas 2005b).
Since the statistical object of drawing correct inferential conclusions about the identification of the best treatment and its relative superiority often conflicts with the ethical aim of maximizing the subjects care, some authors formalize these goals into suitable combined/constrained optimization problems (see, e.g., Rosenberger et al. 2001;Baldi Antognini and Giovagnoli 2010). The ensuing optimal allocations, usually referred to as targets, depend in general on the unknown treatment effects; although a priori unknown (the so-called local optimality problem), they can be approached by RAR procedures that estimate sequentially the model parameters in order to progressively approach the chosen target. Classical examples are the Efficient Randomized Adaptive DEsign (ERADE) proposed by Hu et al. (2009) and the doubly-adaptive biased coin design (Hu and Zhang 2004). Under a different perspective, the same trade-off between ethics and inference represents a special case of the so-called explorationversus-exploitation dilemma in the Bayesian literature of bandit problems, where at each step an agent wants to simultaneously acquire new knowledge and optimize his/her decisions based on existing information (see for review Villar et al. 2015a, b).
Although the adaptation process induces a complex dependence structure, several authors provide the conditions under which the classical asymptotic likelihood-based inference is still valid for RAR procedures (see, e.g., Durham et al. 1997;Melfi and Page 2000;Baldi Antognini and Giovagnoli 2005). Essentially, the crucial one regards the limiting allocation proportion induced by the chosen RAR rule, that should be a non-random quantity different from 0 and 1. Excluding some extremely ethical procedures, such as the randomly reinforced urn designs (May and Flournoy 2009), such condition is generally satisfied by the existing RAR rules and therefore the usual asymptotic properties of the MLEs are preserved; indeed the large majority of the literature has been focussed on the asymptotic likelihood-based inference, where the Wald test is the cornerstone (Rosenberger and Sriram 1996;Rosenberger et al. 1997;Melfi et al. 2001;Hu and Zhang 2004;Atkinson and Biswas 2005a, b;Geraldes et al. 2006;Tymofyeyev et al. 2007;Azriel et al. 2012). Under RAR procedures, Yi and Li (2018) theoretically prove that the Wald statistics is first order efficient, while Yi and Wang (2011) show via simulations that, although asymptotically equivalent to likelihood ratio and score tests, it performs better in small samples. However, several simulation studies exhibit that, in some circumstances, such an approach presents anomalies in terms of coverage probabilities of confidence intervals, as well as inflated type-I errors (see, e.g., Rosenberger and Hu 1999;Yi and Wang 2011;Atkinson and Biswas 2014;Baldi Antognini et al. 2018), especially for targets with a strong ethical component.
The aim of this paper is to demonstrate the inadequacy of asymptotic likelihoodbased inference for RAR procedures, in terms of both confidence intervals and hypothesis testing. We stress the crucial role played by the chosen target, the variance function of the statistical model and the presence of nuisance parameters, that could (i) compromise the quality of the Central Limit Theorem (CLT) approximation of the standard MLEs and (ii) lead to a vanishing Fisher information. In particular, these degeneracies could happen when the variance function is unbounded or when the target allocations approach either 0 or 1 (that depends on both the chosen ethical component and on the relative superiority of a given treatment wrt the other), showing also how the functional form of the target could induce a non monotonic power function. We prove that the Wald test could become inconsistent, it may display a strong dependence on the nuisance parameters, and the standard confidence intervals could degenerate.
Since the common approach of the practitioners consists in superimposing a minimum percentage of allocations to each treatment, we demonstrate that by re-scaling the target some of these drawbacks could be circumvented. We show how a suitable choice of the threshold can be matched with a strong ethical skew of the target without compromising the inferential precision. Several illustrative examples are provided for normal, binary, Poisson and exponential data and simulation studies are performed in order to confirm the relevance of our results.
The paper is structured as follows. Starting from the notation and some preliminaries in Sect. 2, Sect. 3 deals with likelihood-based inference, highlighting its inadequacy for RAR procedures in Sect. 4, with several examples showing the practical implication of the above-mentioned drawbacks. Section 5 discusses our proposal of re-scaling the target and its properties and Sect. 6 deals with some concluding remarks.

Notation and model
Suppose that statistical units come to the trial sequentially and are assigned to one of two competing treatments, say A and B. At each step i ≥ 1, let δ i be the indicator managing the allocation of the ith subject, namely δ i = 1 if he/she is assigned to A and 0 otherwise. Given the treatment assignments, the observed outcomes Y s relative to either treatment are assumed to be independent and identically distributed belonging to the natural exponential family with quadratic variance function Y ∼ N Q(θ ; v(θ)), where θ ∈ Θ ⊆ R denotes the mean and the variance v = v(θ) > 0 is at most a quadratic function of the mean (Morris 1982). In this setting, θ = (θ A ; θ B ) t denotes the treatment effects and from now on we let Θ = sup Θ and Θ = inf Θ. Special cases of particular relevance for applications are the Bernoulli distribution (with θ j ∈ (0; 1) and v(θ j ) = θ j (1−θ j )) for binary outcomes, the Poisson model (θ j ∈ R + and v(θ j ) = θ j ) for count data, the exponential distribution (θ j ∈ R + and v(θ j ) = θ 2 j ) for survival outcomes, while the normal homoscedastic model is also encompassed for continuous responses (where θ j ∈ R and v(θ j ) = v ∈ R + is the common nuisance parameter). In this setting, the treatment outcomes are stochastically ordered on the basis of their effects and from now on (without loss of generality) we assume that high responses are preferable. As it is well known, the N Q class contains two more basic models, such as the negative binomial and the generalized hyperbolic secant distribution, which however may be less appealing for practical applications, especially in the clinical context.
After n steps, let N An = n i=1 δ i and N Bn = n − N An be the assignments to both treatments, so that π n = n −1 N An is the allocation proportion to A (respectively, 1 − π n to B). Then, the MLEs of the treatment effects coincide with the sample means, Baldi and Giovagnoli 2015).

Target allocations and RAR rules
Motivated by ethical demands, Response-Adaptive procedures have been proposed with the aim of skewing the assignments towards the treatment that appears to be superior or, more in general, of converging to suitable limiting allocation proportionssay ρ = ρ(θ ) ∈ (0; 1) to A (and 1 − ρ to B, respectively)-namely ideal allocations of the treatments representing a valid trade-off among ethics and inference.
In the context of binary trials, a classical example is the PW rule (Zelen 1969), under which a success on a given treatment leads to assigning the same treatment to the next unit, while a failure implies switching to the competitor. Under this procedure, the allocation proportion of treatment A converges to which is also the limiting allocation of the randomized PW (Wei and Durham 1978) and of the Drop-the-Loser rule (Ivanova 2003). Differently, for normal homoscedastic trials Bandyopadhyay and Biswas (2001) and Atkinson and Biswas (2005b) suggested RAR procedures targeting where Φ is the cumulative distribution function (cdf) of the standard normal and T > 0 a tuning parameter. Although ρ PW and ρ N are considered ethical targets, as the majority of subjects are assigned to the best treatment, they do not have a formal mathematical justification. On the other hand, by expressing ethical aims and inferential goals into suitable design criteria, several authors provided optimal allocations via combined/constrained optimization problems. An example for binary trials is the target proposed by Rosenberger et al. (2001) and further generalized by Tymofyeyev et al. (2007), namely which is aimed at minimizing the expected number of failures for a given variance of the estimated treatment difference, while corresponds to the so-called A-and E-optimal design for exponential and Poisson data, respectively (Baldi and Giovagnoli 2015). Clearly, these targets also encompass normal homoscedastic data provided that the treatment effects are positive (Zhang and Rosenberger 2006). In order to favour the best treatment, the targets should depend on a suitable discrepancy measure between the unknown treatment effects (like, e.g., the treatment difference in ρ N , the ratio between the effects for ρ R or the relative risk in ρ PW ), so that the target function ρ links the relative superiority of a given treatment to the ethical skewness of the allocations. Moreover, as for (2), the targets could also depend on a non-negative constant T -chosen by the experimenter-managing their ethical skew (i.e., for low values of T the target tends to strongly skew the assignments to the best treatment, while as T grows the ethical component vanishes and ρ tends to balance the allocations). Therefore, common assumptions are: A1: ρ is a continuous function invariant under label permutation of the treatments, namely ρ(θ A ; θ B ) = 1 − ρ(θ B ; θ A ), A2: ρ is increasing in θ A and decreasing in θ B , ensuring that (i) both treatments are treated likewise and (ii) the best treatment should be favoured increasingly as its relative superiority grows.
Remark 1 Note that, on the basis of the underlying statistical model, the well-known Neyman allocation ρ(θ ) , the A-optimal design-may not have any ethical appeal, since the majority of patients could be assigned to the worst treatment. Indeed, for binary and normal outcomes it does not satisfy assumption A2, while for Poisson and exponential data the Neyman target is ethical and corresponds to ρ Z and ρ R , respectively.
Given a desired ρ, RAR rules based on sequential estimation could be employed to converge to it. After a starting sample of n 0 subjects assigned to both treatments to derive non-trivial estimates of the unknown parameters, at each step n > 2n 0 the treatment effects are estimated by means ofθ n = (θ An ;θ Bn ) t and the target is estimated accordingly by ρ(θ n ), so the next assignment is forced to converge to ρ. For instance, ERADE (Hu et al. 2009) randomizes the allocations by where γ ∈ [0; 1) is the randomization parameter of the allocation process.

Asymptotic likelihood-based inference for RAR procedures
Assuming that the inferential goal consists in estimating/testing the superiority of a given treatment with respect to the gold standard (say A wrt B), the parameter of interest is the treatment difference Δ = θ A − θ B , while θ B is usually regarded as a nuisance parameter (namely, θ B is a common baseline while Δ represents the additive effect of the relative superiority/inferiority of A over B). Although the MLEs remain the same as the non-sequential setting's ones, this is not true for their distribution because of the complex dependence structure generated by the adaptation process. However, if the RAR design is chosen so that with ρ(θ) satisfying assumptions A1-A2, then the standard asymptotic inference is allowed. Indeed, and the MLEs are still consistent and asymptotically normal with and, due to the continuity of the target, lim n→∞ ρ(θ n ) = ρ(θ ) a.s. Lettingv jn s be consistent estimators of the treatment variances, then is a consistent estimators of σ 2 ρ and the (1 − α)% asymptotic confidence interval is where z α is the α-percentile of Φ.
For what concerns hypothesis testing, the inferential aim typically lies in testing H 0 : Δ = 0 against H 1 : Δ > 0 (or H 1 : Δ = 0). The asymptotic test is usually performed via the Wald statistic W n = √ nΔ nσ −1 n which, under H 0 , converges to the standard normal distribution. Thus, given the alternative H 1 : Δ > 0, the power of the due to the consistency ofσ 2 n . As stated by several authors (Lehmann 1999;Hu and Rosenberger 2006;Tymofyeyev et al. 2007), this approximation is accurate and particularly effective in the moderate-large sample setting of phase-III trials therefore neither for early phase studies with small sample sizes, nor asymptotically (where different approaches aimed at providing proper local approximation of the power around specific value of Δ as n → ∞ could be suitable like e.g. the local alternative framework).
Even if less interesting in the actual practice, the two-sided alternative H 1 : Δ = 0 can be encompassed analogously. Under H 0 , W 2 n converges in distribution to a central chi-square χ 2 1 with 1 degree of freedom; while under H 1 , W 2 n could be approximated by a non-central χ 2 1 with non-centrality parameter nΔ 2 σ −2 ρ , namely the square of the crucial quantity in (5). As is well-known, the power is an increasing function of the non-centrality parameter and it is maximized by the Neyman allocation, also minimizing (3).

Inadequacy of likelihood-based inference
Note that condition C1 avoids the extreme scenarios ρ = 0 or 1; however, most of the targets suggested in the literature satisfy the following property: It is worth stressing that, even if the symmetric assumption A1 holds, ρ → 1 as θ A → Θ does not imply that ρ → 0 as θ A → Θ and vice-versa (see, e.g., ρ PW in (1)). If ρ satisfies (6) or if the variance function of the statistical model is unbounded, then the asymptotic variance σ 2 ρ tends to diverge and the quality of the CLT approximation could be damaged, thus compromising any likelihood-based inferential procedure. This translates in both i) unreliable asymptotic confidence intervals and ii) anomalous behaviour of the power of the Wald test.

Confidence Intervals
The following Theorem shows the drawbacks of the asymptotic likelihood-based confidence intervals, that could degenerate not only for statistical models with unbounded variance, but also when the chosen target is characterized by a strong ethical component, i.e., if ρ satisfies (6).

Theorem 1
The asymptotic variance σ 2 ρ and the width of the asymptotic C I (Δ) 1−α diverge if the variance function is unbounded, i.e. when Θ = ∞ and lim θ→Θ v(θ) = ∞, or if ρ is chosen so that In particular, for exponential and Poisson data, the width of C I (Δ) 1−α diverges as Δ grows regardless of the chosen target, while for normal homoscedastic outcomes, the asymptotic CI degenerates for every target satisfying (6). As regards binary trials, Proof The proof follows directly from (3) by noticing that condition lim θ A →Θ ρ(θ A ; θ B ) = 0 for every θ B ∈ Θ is only necessary but not sufficient, since the variance function could vanish as θ A → Θ. For normal homoscedastic, exponential and Poisson data the proof is straightforward. For binary trials, under the PW target, the asymp- The divergence of the asymptotic CIs strongly depends on the speed of convergence of the target to 0 or 1. For instance, taking into account ρ N in (2), this can be severely accentuated by the effect of the tuning constant, since T induces a scaling effect by contracting/expanding the treatment difference Δ (for T > 1 or T < 1, respectively). Thus, small choices of T may deteriorate the quality of the CLT approximation as well as accelerate the divergence of the asymptotic variance σ 2 ρ , even for values of θ A close to θ B (i.e., for values of Δ close to 0) and not only as θ A tends either to Θ or Θ).
Example 1 In order to stress how small values of T could severely undermine the precision of likelihood-based inferential procedure, we perform a simulation study with 100000 normal homoscedastic trials (v = 1) by employing ERADE (γ = 0.5) with n = 250. Taking into account ρ N , Fig. 1 shows the simulated distributions of the MLEΔ n , as Δ and T vary, while Table 1 summarizes the behaviour of the simulated 95% asymptotic confidence intervals for Δ, where Lower (L) and Upper (U) bounds are obtained by averaging the endpoints of the simulated trials (within brackets the corresponding theoretical values derived by (4)).
When Δ = 0, low values of T severely damage the CLT approximation leading to a non-negligible increase of the density in the tails; whereas, for Δ > 0 the distribution ofΔ n presents a positive skewness, regardless of the value of T .
For T ≥ 1, analytical and simulated confidence bounds are quite close; however, as Δ grows, the impact of the skewness affects the quality of the CLT approximation. Regardless of Δ, small values of T severely damage the accuracy of the C I (Δ) 0.95 , that tends to diverge extremely fast. The empirical coverage confirms the above-mentioned behaviour and tends to 1 as the width of the intervals grows. Moreover, as showed by many authors (see, e.g., Coad and Woodroofe 1998), although asymptotically unbiased, the MLEs under RAR procedures are biased for finite samples. Even for n = 250, Δ n tends to overestimate Δ for positive values of the treatment difference and this effect is exacerbated for low values of T .

Hypothesis Testing
Taking now into account hypothesis testing, for every fixed value of the nuisance parameter θ B ∈ Θ (and v ∈ R + for normal homoscedastic data), the power function (5) is governed by the non-negative function Notice that the Wald test could present inflated type-I errors. Indeed, when θ A = θ B , from assumption A1, ρ(θ) = 1 − ρ(θ ) = 1/2 and therefore t ρ (0) = 0 for every θ B ∈ Θ regardless of the chosen target. Moreover, since in this case σ ρ = 2 √ v(θ B ), inflated type-I errors could be present only if v(θ B ) 0. This is the reason why a slightly inflation is detected in several simulation studies of both binary trials with low  Under the alternative hypothesis, the power could exhibit anomalous behaviour, especially when ρ has a strong ethical skew. In particular, we shall show that, for a given statistical model, some target allocations may induce a non monotonic power-that could also degenerate as the difference between the treatment effects grows-making the Wald test not consistent. Indeed, for every size, if t ρ (Δ) in (7) vanishes as Δ grows, from (5) the power tends to Φ (−z 1−α ) = α (i.e., the significance level), as the following Theorem shows.
In particular, for binary trials the Wald test is consistent under ρ R , while it is not adopting ρ PW . Taking into account Poisson, exponential and normal homoscedastic models, ρ R guarantees the consistency of the Wald test, while ρ N induces the inconsistency of the test.
Proof Given a chosen target ρ, the Wald test is not consistent when t ρ (Δ) in (7) vanishes as Δ grows. For Θ < ∞, from Theorem 1 this is satisfied iff lim θ A →Θ ρ(θ A ; θ B ) = 1 for every θ B ∈ Θ. For Θ = ∞, the same conclusion still holds provided that as θ A → ∞, σ 2 ρ diverges faster than θ 2 A . Since for the N Q class the variance function v(·) is at most quadratic, this holds iff For binary trials, assuming the PW target in (1) the power tends to α as Δ grows, since lim θ A →Θ ρ PW (θ A ; θ B ) = 1, for every θ B ∈ (0; 1). Whereas, adopting ρ R , lim θ A →Θ ρ R (θ A ; θ B ) = (1 + θ B ) −1 < 1 for every θ B ∈ (0; 1) and therefore the test is consistent. Taking into account Poisson, exponential and normal homoscedastic models, adopting ρ R the test is consistent since lim Remark 2 Although condition lim θ A →Θ ρ(θ A ; θ B ) = 1 is always necessary for the inconsistency of the Wald test, for binary trials it is also sufficient, making the PW rule unsuitable for likelihood-based inference. Excluding the binary case, in order to reliably apply the Wald statistic, ρ should satisfy Remark 3 Although our approach complements the one of Yi and Li (2018), Theorems 1 and 2 clearly conflict with their results. In particular, the authors show that the Wald statistic achieves the upper bound of the asymptotic power and derive the rates of coverage error probability of the corresponding confidence intervals. Their results depend on the boundedness of the remainder term in the Taylor expansion of Lemma 1 in Yi and Li (2018), where the authors state that if ρ ∈ (0; 1) then there exists r ∈ (0; 1/2] such that r ≤ ρ ≤ 1 − r . However, this condition does not hold for targets satisfying (6) (for instance, r ∈ (0; 1/2] bounding ρ N ). Example 2 To underline how the adoption of the PW target could severely undermine the reliability of the Wald test, we perform a simulation study with 100,000 binary trials by employing ERADE (γ = 0.5). Figure 2 shows the simulated power as Δ varies for θ B = 0.7, 0.8 and 0.9 for different sample sizes. As theoretically proved, the power tends to the significance level α regardless of the sample size. Moreover, the power function is decreasing not only at θ A ≈ 1 but also for smaller and potentially crucial differences between the treatment effects, especially for small samples. For instance, when n = 100, for θ B = 0.9 the maximum power is about 25% attained at Δ = 0.07 (i.e., θ A = 0.97), while for θ B = 0.8 the power is always lower than 75% and rapidly decreases for Δ ≥ 0.16. Even with n = 250, the power does not reach 1 when θ B > 0.8; although such a degenerating behaviour is attenuated as the sample size increases, it still persists also for n = 400.
An additional drawback of the PW target is related to its functional form. Indeed, although condition A2 is satisfied (namely, ρ PW is decreasing in θ B and therefore 1 − ρ PW is increasing in θ B ), for any fixed difference Δ = θ A − θ B , the allocation to B is decreasing in θ B as the following table shows. Indeed, the PW target could be rewritten as leading to a negative derivative wrt θ B of 1−ρ PW (i.e., the target allocation of treatment B).
Besides consistency, an additional natural requirement of the test is that the power should be monotonically increasing in Δ (i.e., in θ A for every θ B ∈ Θ), in order to identify with high precision the best treatment as its relative superiority grows. From (7), provided that ρ is differentiable, the power of the Wald test is increasing iff, for every θ B ∈ Θ, (8) where f x = ∂ f /∂ x denotes the partial derivative of f wrt x (to avoid cumbersome notation, we shall omit the subscript for the derivative of scalar functions). In addition to the statistical model, condition (8) regards the chosen target and needs to be satisfied for every θ A > θ B , involving the entire functional form of ρ (not only its limits and the speed of convergence to them as in Theorems 1 and 2). Clearly, if the target induces the inconsistency of the test, then (8) fails to hold, instead if ρ guarantees the consistency of the test, it does not necessarily ensure the monotonicity of the power, as shown in Fig. 5. For instance, as also discussed by Baldi Antognini et al. (2018), for normal homoscedastic data v = 0 and the power is increasing in Δ iff ρ is chosen so that, Clearly, this condition fails to hold for ρ N , while it is satisfied by ρ R . Analogously, for binary trials adopting ρ PW the power of the Wald test is not monotonically increasing. Indeed, condition (8) can be restated as where, for every θ B ∈ (0; 1), as θ A tends to Θ = 1 the LHS tends to −∞ while the RHS tends to 1/(1 − θ B ) > 0.

Proposition 1 For normal, binary, exponential and Poisson data, ρ R always guarantees that the power of the Wald test is monotonically increasing in Δ.
Proof For the normal homoscedastic model, inequality (9) is trivially satisfied since For Poisson and exponential data, condition (8) still holds since, for every θ B ∈ R + , In the context of binary trials, inequality (8) becomes As previously discussed, ρ R is able to preserve the fundamental properties of the Wald test, namely the consistency and the monotonicity of its power. However, this target strongly depends on the nuisance parameter θ B ; indeed, for a fixed difference Δ, as θ B grows ρ R (θ A ; θ B ) → 1/2 and, therefore, its ethical improvement tends to vanish as well as the induced power. For instance, from (7), under exponential outcomes and both of them vanish as θ B grows, for every fixed θ A . Figure 3 confirms graphically the crucial role played by θ B in terms of power: given a difference Δ = 0.5, under the exponential model the power decreases from 0.94 to 0.10 as θ B grows from 1 to 10 (while for Poisson data it goes from 0.97 to 0.34).

A possible solution for likelihood-based inference: the re-scaled target
From Theorems 1 and 2, it is quite evident that some anomalous behaviours could be prevented by assuming a target that is not characterized by a strong ethical component, namely under which (6) fails to hold. Indeed, if the target is chosen so that 0 < l 1 ≤ ρ(θ) ≤ l 2 < 1 for every θ , then the Wald test is consistent, while C I (Δ) 1−α does not diverge provided that v(·) is bounded. Moreover, to mitigate the effects of the nuisance parameters, a possible way consists in adopting targets that depend only on the treatment difference Δ and not on θ B , namely ρ = ρ (Δ); however, this is only a partial solution, since the nuisance parameter affects any likelihood-based inferential procedure through the variance function. In this setting, assumptions A1-A2 become A: ρ is continuous and increasing with ρ (Δ) = 1 − ρ (−Δ).
is the asymptotic allocation of the doubly-adaptive weighted difference design, suggested by Geraldes et al. (2006). It is obtained by a suitable weighted combination of two linear randomization functions, one for ethics and the other dictated by balance, where ω ∈ [0; 1] reflects the relative importance of ethics. Note that ρ G guarantees the consistency of the Wald test and the reliability of the CIs, since as for every ω < 1. By combining these suggested solutions, even when the desired ρ is characterized by a strong ethical improvement, a possible way to overcome some degeneracies consists in re-scaling the target, namely by letting ρ r (Δ) = 1 − r + ρ (Δ)(2r − 1), with r ∈ (1/2; 1).
Although the anomalous scenarios induced by the unboundedness of the variance function-i.e., by the statistical model-cannot be overcome, by adopting ρ r some degeneracies caused by the target could be avoided, since the Wald test is consistent and C I (Δ) 1−α does not diverge.

Remark 4
Since under condition C1 the treatment allocation proportion π n of a RAR design is a consistent estimator of the target, another possible way to overcome some drawbacks of likelihood-based asymptotic procedures consists in estimating σ 2 ρ by σ 2 n =v An /π n +v Bn /[1 − π n ]. Indeed, given a starting sample of 2n 0 assignments, for any fixed n, π n ∈ [η n ; 1 − η n ], where η n = n 0 /n ∈ (0; 1/2) is the percentage of (non-adaptive) allocations initially made on either treatment. In practice, π n ρ(θ n )(1 − η n ) + [1 − ρ(θ n )]η n , that substantially corresponds to assume a re-scaled target with r = r (n) = 1 − η n . Unfortunately, this approach could be useful only for clinical trials where η n is non-negligible (i.e., for quite small samples), otherwise n 0 should be chosen as an increasing function of n (Baldi Antognini et al. 2018).
Although the re-scaling correction could also be applied to targets depending on nuisance parameters, in general it does not protect against the non monotonicity of the power function discussed in Section 4. However, since 0 < ρ r θ A = (2r −1)ρ θ A < ρ θ A , then monotonicity condition (8) tends to be satisfied as r decreases (namely when the target tends to be balanced); thus, as it will be shown in Examples 3 and 4, this drawback could be strongly mitigated/overcome by re-scaling the target with a proper choice of r .

Example 3
To show how a re-scaled target not depending on the nuisance could improve the precision of likelihood-based inference, we perform a simulation study in the same setting of Example 1 by adopting ρ N r with r = 0.9. Figure 4 shows the simulated distributions ofΔ n as T and Δ vary, while Table 3 summarises the behaviour of the simulated 95% asymptotic confidence interval for Δ, where Lower (L) and Upper (U) bounds are obtained by averaging the endpoints of the simulated trials (within brackets the theoretical values derived by (4)).
Adopting ρ N r , the reliability of the C I (Δ) 0.95 drastically increases: analytical and simulated bounds almost coincide for every value of T and Δ. Although for small values of T (i.e., for a high ethical component) the width of the confidence intervals slightly grows, this does not compromise the inferential precision. By limiting the skewness and the variability of the MLE's distribution, the re-scaled target significantly improves the accuracy of the asymptotic confidence intervals, also confirmed by the empirical coverage which is always quite close to the nominal value. Note that the re-scaling correction seems also to reduce the bias of the MLEs, in particular for higher values of the treatment difference.
As regards hypothesis testing, Fig. 5 shows the power of the Wald test adopting ρ N r as T and r vary (the case r = 1 corresponds to ρ N ).
Regardless of the values of T , the re-scaled target (i.e., r < 1) always preserves the consistency of the test. However, this target does not satisfies condition (9) and, for small values of T , the decreasingness of the power is accentuated as r tends to 1. Even for T = 0.5 or T = 0.3, by selecting r ≤ 0.95, monotonicity condition (9) is fulfilled; in this way the ethical component of the target could be strongly improved without compromising inference.
Example 4 Ideally, the re-scaling correction should be applied to targets with a strong ethical skew-i.e., satisfying (6)-that (i) fulfill (8) to guarantee a monotonic power function of the Wald test and (ii) depend on the treatment effects only through the difference Δ (to mitigate the effects of the nuisance parameters). As previously shown, when adopting ρ PW none of these conditions is satisfied; however, the re-scaled version ρ PW r could still overcome or mitigate some of the above-mentioned drawbacks. To see this, we perform a simulation study in the same setting of Example 2, by comparing the performances of ρ PW and ρ PW r with r = 0.9. Figure 6 shows the simulated power of the Wald test as Δ varies for θ B = 0.7, 0.8 and 0.9 for n = 100, 250 and 400, while Table 4 summarizes the behaviour of the simulated 95% asymptotic confidence   (4)).
If compared to ρ PW (see Fig. 2), the re-scaled target ρ PW r guarantees the consistency of the Wald test, also strongly improving the behaviour of the power function. The improvement in the inferential precision is remarkable: for instance, with n = 100 and θ B = 0.9, for Δ = 0.08 the power is about 40% with a gain of 13% wrt the non re-scaled version, while for n = 250 the power increases of 18%. For what concerns CIs, although ρ PW performs quite well, the asymmetric distribution of the MLEs causes a right shift of the CI with a slight increase in the width (that is exacerbated for θ A > 0.95). On the other hand, the adoption of ρ PW r leads to narrower and centered CIs with a correct empirical coverage.

Discussion
This paper explores in depth the limitations of the likelihood-based approach for RAR experiments, in terms of asymptotic confidence intervals and hypothesis testing.
Although clinical trials represent one of the most actual fields of application of this methodology (because of the main concern about the ethical impact on the subjects' care), RAR procedures could be a useful tool for local optimality problems also in different contexts like, e.g., industrial experiments. First of all, we show that some RAR rules as well as some targets can compromise the asymptotic likelihood-based

Fig. 6
Simulated power of the Wald test adopting ρ PW r (with r = 0.9) as θ B and n vary inference, inducing a degenerating behaviour of the power of the Wald test and unreliable CIs. This is particularly true when the empirical evidence strongly suggests the superiority of one treatment wrt the other or when the ethical component of the target is remarkable, since this could induce the target to approach either 0 or 1. Furthermore, these anomalies may also be caused by statistical models with unbounded variance, and inference could also be strongly compromised due to the effect of nuisance parameters.
Our results show that, in general, ρ R is able to preserve the fundamental properties of hypothesis testing, because it guarantees the consistency of the Wald test as well as the monotonicity of its power; however, its dependence on the nuisance parameter could damage the inferential precision. On the other hand, the PW rule confirms its practical inadequacy since i) the asymptotic CIs diverge and ii) the power of the Wald test is decreasing and tends to the significance level as the difference between the treatment effects grows, thus severely undermining the inferential precision.
Inspired by the common practice of superimposing a minimum percentage of allocations for each treatment, several authors have recently taken into account RAR procedures with a minimum prefixed threshold in the assignments to avoid possible degeneracies (see Tymofyeyev et al. 2007;Sverdlov et al. 2011;Sverdlov and Rosenberger 2013;Villar et al. 2015b). In this paper, we prove how a re-scaling correction of the target could preserve some of the fundamental properties of likelihood-based inference. In particular, we show that, by adopting a re-scaled target, the consistency of the Wald test and the reliability of the CIs are ensured (provided that the variance function is bounded), even with a high ethical component. Moreover, choosing a suitable threshold r significantly improves the accuracy of the asymptotic likelihood-based CIs (also confirmed by the empirical coverage which is quite closed to the nominal value) and overcomes the non monotonicity of the power function. Generally, a choice of r = 0.9 preserves the inferential accuracy, regardless of the statistical model and of the adopted target. As regards ρ N , r = 0.9 matched with T ≥ 0.5 guarantees good performances in terms of both ethics and inference. Clearly, these results could also be applied to the class of Bayesian RAR designs, where frequentist likelihood-based inference is performed at the end of the trial. Indeed, Bayesian RAR procedures could also present possible degeneracy in the treatment allocation proportions and therefore a re-scaling correction could represent a valid tool for inference. For instance, as recently discussed by Villar et al. (2018) for the case of several treatments, superimposing a minimum percentage of allocation to the control group produces robust inference by preserving type-I errors even in the case of time trends.
However, in some circumstances, other critical issues related to the unboundedness of the variance function and the effect of the nuisance parameters cannot be circumvented by simply re-scaling the target. This is the case, for example, of ρ R and ρ Z under exponential and Poisson responses, respectively (namely, the corresponding Neyman allocations); their re-scaled versions, while maintaining the same inferential performances of the non re-scaled counterparts, do not protect against neither the strong dependence on the nuisance parameter nor the unboundedness of the variance function. In such situations, alternative inferential approaches could be preferable and one of the most promising is randomization-based inference (Wei 1988;Rosenberger 1993). Under this framework, the equality of treatment groups corresponds to an allocation in which the assignments are unrelated to the responses; inference is thus carried out by computing the distribution of the treatment allocations conditionally on the observed outcomes, that are treated as deterministic. Since the distribution of the test depends on the chosen RAR rule, exact results are quite few and, generally, p-values and the endpoints of confidence intervals are computed by Monte Carlo methods (for recent contributions see Wang et al. 2020 for randomization tests and Wang and Rosenberger 2020 for randomization-based interval estimation).
Our results are focussed on the case of two treatments, but a suitable extension to the multi-armed case could be very relevant. Indeed, for K > 2 treatments, multiple comparisons between the treatment groups should be taken into account for inference (some of them with possibly different importance, due to e.g., previous knowledge about a gold standard, the presence of a control arm). As showed by Tymofyeyev et al. (2007), Sverdlov et al. (2011) andBaldi Antognini et al. (2019), the optimal design maximizing the power of the Wald test of homogeneity is a degenerate allocation involving only the best and the worst treatments without observations on the intermediate ones (here, the treatment order is the usual stochastic order between random variables). This clearly leads to unreliable inference about the treatment contrasts and, at the same time, problems also arise from the ethical viewpoint, since more than half of the patients could be assigned to the less effective treatment. A re-scaling transformation can still be applied for multidimensional target ρ t = (ρ 1 , . . . , ρ K ) with ρ i ≥ 0 and K i=1 ρ i = 1 by letting, analogously to (10), which ensures that ρ ir ∈ [(1 − r )/(K − 1); r ] and K i=1 ρ ir = 1. However, in this setting the impact of the re-scaling correction in terms of estimation efficiency and power needs to be studied. This topic, as well as proper comparisons between likelihood-based and randomization-based inference, is left for future research.