1 Introduction

Over the past decades a growing stream of statistical papers on the topic of Response-Adaptive Randomization (RAR) has flourished, especially in the context of phase-III clinical trials for treatment comparisons, also due to the encouragement of U.S. government agencies and Health Authorities (CHMP 2007; FDA 2018). RAR procedures are sequential allocation rules in which the allocation probabilities change on the basis of earlier responses and past assignments; the aim is to balance the experimental goals of drawing correct inferential conclusions and caring about the welfare of each patient, the so-called individual-versus-collective ethics dilemma (for a recent review, see Hu and Rosenberger 2006; Atkinson and Biswas 2014; Baldi and Giovagnoli 2015; Rosenberger and Lachin 2015). A cornerstone example is the randomized Play-the-Winner (PW) suggested for binary trials (see, e.g., Wei and Durham 1978; Ivanova 2003). The peculiarity of the PW rule is that the allocation proportion of each of the two treatments converges to the relative risk of the other, so that (asymptotically) the majority of patients will receive the best treatment. Another example, for normal and survival outcomes, is the treatment effect mapping (Rosenberger 1993), where the assignments are based on a function that links the difference between the treatment effects to the ethical skew of the allocation probability (Rosenberger and Seshaiyer 1997; Bandyopadhyay and Biswas 2001; Atkinson and Biswas 2005b).

Since the statistical object of drawing correct inferential conclusions about the identification of the best treatment and its relative superiority often conflicts with the ethical aim of maximizing the subjects care, some authors formalize these goals into suitable combined/constrained optimization problems (see, e.g., Rosenberger et al. 2001; Baldi Antognini and Giovagnoli 2010). The ensuing optimal allocations, usually referred to as targets, depend in general on the unknown treatment effects; although a priori unknown (the so-called local optimality problem), they can be approached by RAR procedures that estimate sequentially the model parameters in order to progressively approach the chosen target. Classical examples are the Efficient Randomized Adaptive DEsign (ERADE) proposed by Hu et al. (2009) and the doubly-adaptive biased coin design (Hu and Zhang 2004). Under a different perspective, the same trade-off between ethics and inference represents a special case of the so-called exploration-versus-exploitation dilemma in the Bayesian literature of bandit problems, where at each step an agent wants to simultaneously acquire new knowledge and optimize his/her decisions based on existing information (see for review Villar et al. 2015a, b).

Although the adaptation process induces a complex dependence structure, several authors provide the conditions under which the classical asymptotic likelihood-based inference is still valid for RAR procedures (see, e.g., Durham et al. 1997; Melfi and Page 2000; Baldi Antognini and Giovagnoli 2005). Essentially, the crucial one regards the limiting allocation proportion induced by the chosen RAR rule, that should be a non-random quantity different from 0 and 1. Excluding some extremely ethical procedures, such as the randomly reinforced urn designs (May and Flournoy 2009), such condition is generally satisfied by the existing RAR rules and therefore the usual asymptotic properties of the MLEs are preserved; indeed the large majority of the literature has been focussed on the asymptotic likelihood-based inference, where the Wald test is the cornerstone (Rosenberger and Sriram 1996; Rosenberger et al. 1997; Melfi et al. 2001; Hu and Zhang 2004; Atkinson and Biswas 2005a, b; Geraldes et al. 2006; Tymofyeyev et al. 2007; Azriel et al. 2012). Under RAR procedures, Yi and Li (2018) theoretically prove that the Wald statistics is first order efficient, while Yi and Wang (2011) show via simulations that, although asymptotically equivalent to likelihood ratio and score tests, it performs better in small samples. However, several simulation studies exhibit that, in some circumstances, such an approach presents anomalies in terms of coverage probabilities of confidence intervals, as well as inflated type-I errors (see, e.g., Rosenberger and Hu 1999; Yi and Wang 2011; Atkinson and Biswas 2014; Baldi Antognini et al. 2018), especially for targets with a strong ethical component.

The aim of this paper is to demonstrate the inadequacy of asymptotic likelihood-based inference for RAR procedures, in terms of both confidence intervals and hypothesis testing. We stress the crucial role played by the chosen target, the variance function of the statistical model and the presence of nuisance parameters, that could (i) compromise the quality of the Central Limit Theorem (CLT) approximation of the standard MLEs and (ii) lead to a vanishing Fisher information. In particular, these degeneracies could happen when the variance function is unbounded or when the target allocations approach either 0 or 1 (that depends on both the chosen ethical component and on the relative superiority of a given treatment wrt the other), showing also how the functional form of the target could induce a non monotonic power function. We prove that the Wald test could become inconsistent, it may display a strong dependence on the nuisance parameters, and the standard confidence intervals could degenerate.

Since the common approach of the practitioners consists in superimposing a minimum percentage of allocations to each treatment, we demonstrate that by re-scaling the target some of these drawbacks could be circumvented. We show how a suitable choice of the threshold can be matched with a strong ethical skew of the target without compromising the inferential precision. Several illustrative examples are provided for normal, binary, Poisson and exponential data and simulation studies are performed in order to confirm the relevance of our results.

The paper is structured as follows. Starting from the notation and some preliminaries in Sect. 2, Sect. 3 deals with likelihood-based inference, highlighting its inadequacy for RAR procedures in Sect. 4, with several examples showing the practical implication of the above-mentioned drawbacks. Section 5 discusses our proposal of re-scaling the target and its properties and Sect.  6 deals with some concluding remarks.

2 Preliminaries

2.1 Notation and model

Suppose that statistical units come to the trial sequentially and are assigned to one of two competing treatments, say A and B. At each step \(i \ge 1\), let \(\delta _i\) be the indicator managing the allocation of the ith subject, namely \(\delta _i = 1\) if he/she is assigned to A and 0 otherwise. Given the treatment assignments, the observed outcomes Ys relative to either treatment are assumed to be independent and identically distributed belonging to the natural exponential family with quadratic variance function \(Y\sim NQ(\theta ;v(\theta ))\), where \(\theta \in \varTheta \subseteq \mathbb {R}\) denotes the mean and the variance \(v=v(\theta ) > 0\) is at most a quadratic function of the mean (Morris 1982). In this setting, \(\varvec{\theta }=(\theta _A;\theta _B)^t\) denotes the treatment effects and from now on we let \(\overline{\varTheta } = \sup \varTheta \) and \(\underline{\varTheta } = \inf \varTheta \). Special cases of particular relevance for applications are the Bernoulli distribution (with \(\theta _j\in (0;1)\) and \(v(\theta _j)=\theta _j(1-\theta _j))\) for binary outcomes, the Poisson model (\(\theta _j\in \mathbb {R}^+\) and \(v(\theta _j)=\theta _j\)) for count data, the exponential distribution (\(\theta _j\in \mathbb {R}^+\) and \(v(\theta _j)=\theta ^2_j\)) for survival outcomes, while the normal homoscedastic model is also encompassed for continuous responses (where \(\theta _j\in \mathbb {R}\) and \(v(\theta _j)=v \in \mathbb {R}^+\) is the common nuisance parameter). In this setting, the treatment outcomes are stochastically ordered on the basis of their effects and from now on (without loss of generality) we assume that high responses are preferable. As it is well known, the NQ class contains two more basic models, such as the negative binomial and the generalized hyperbolic secant distribution, which however may be less appealing for practical applications, especially in the clinical context.

After n steps, let \(N_{An}= \sum _{i = 1}^n \delta _i\) and \(N_{Bn}= n-N_{An}\) be the assignments to both treatments, so that \(\pi _n = n^{- 1} N_{An}\) is the allocation proportion to A (respectively, \(1 - \pi _n\) to B). Then, the MLEs of the treatment effects coincide with the sample means, namely \(\hat{\theta }_{An} = N_{An}^{-1} \sum _{i = 1}^n \delta _i Y_i \) and \(\hat{\theta }_{Bn} = N_{Bn}^{-1} \sum _{i = 1}^n (1 - \delta _i) Y_i\), while the normalized Fisher information is \(\mathbf {M}_n =\text {{diag}}\left( \pi _n/v_A; [1-\pi _n]/v_B \right) \) (see Baldi and Giovagnoli 2015).

2.2 Target allocations and RAR rules

Motivated by ethical demands, Response-Adaptive procedures have been proposed with the aim of skewing the assignments towards the treatment that appears to be superior or, more in general, of converging to suitable limiting allocation proportions—say \(\rho =\rho (\varvec{\theta })\in (0;1)\) to A (and \(1 - \rho \) to B, respectively)—namely ideal allocations of the treatments representing a valid trade-off among ethics and inference.

In the context of binary trials, a classical example is the PW rule (Zelen 1969), under which a success on a given treatment leads to assigning the same treatment to the next unit, while a failure implies switching to the competitor. Under this procedure, the allocation proportion of treatment A converges to

$$\begin{aligned} \rho _{PW}(\varvec{\theta }) = \frac{1-\theta _B}{2-\theta _A-\theta _B}, \end{aligned}$$
(1)

which is also the limiting allocation of the randomized PW (Wei and Durham 1978) and of the Drop-the-Loser rule (Ivanova 2003). Differently, for normal homoscedastic trials Bandyopadhyay and Biswas (2001) and Atkinson and Biswas (2005b) suggested RAR procedures targeting

$$\begin{aligned} \rho _N(\varvec{\theta }) = \varPhi \left( \frac{\theta _A - \theta _B }{T}\right) , \end{aligned}$$
(2)

where \(\varPhi \) is the cumulative distribution function (cdf) of the standard normal and \(T > 0\) a tuning parameter. Although \(\rho _{PW}\) and \(\rho _N\) are considered ethical targets, as the majority of subjects are assigned to the best treatment, they do not have a formal mathematical justification. On the other hand, by expressing ethical aims and inferential goals into suitable design criteria, several authors provided optimal allocations via combined/constrained optimization problems. An example for binary trials is the target proposed by Rosenberger et al. (2001) and further generalized by Tymofyeyev et al. (2007), namely

$$\begin{aligned} \rho _Z(\varvec{\theta }) = \frac{\sqrt{\theta _A}}{\sqrt{\theta _A}+ \sqrt{\theta _B}}, \end{aligned}$$

which is aimed at minimizing the expected number of failures for a given variance of the estimated treatment difference, while

$$\begin{aligned} \rho _R(\varvec{\theta }) = \frac{\theta _A}{\theta _A+ \theta _B}, \end{aligned}$$

corresponds to the so-called A- and E-optimal design for exponential and Poisson data, respectively (Baldi and Giovagnoli 2015). Clearly, these targets also encompass normal homoscedastic data provided that the treatment effects are positive (Zhang and Rosenberger 2006).

In order to favour the best treatment, the targets should depend on a suitable discrepancy measure between the unknown treatment effects (like, e.g., the treatment difference in \(\rho _N\), the ratio between the effects for \(\rho _R\) or the relative risk in \(\rho _{PW}\)), so that the target function \(\rho \) links the relative superiority of a given treatment to the ethical skewness of the allocations. Moreover, as for (2), the targets could also depend on a non-negative constant T—chosen by the experimenter—managing their ethical skew (i.e., for low values of T the target tends to strongly skew the assignments to the best treatment, while as T grows the ethical component vanishes and \(\rho \) tends to balance the allocations). Therefore, common assumptions are:

A1::

\(\rho \) is a continuous function invariant under label permutation of the treatments, namely \(\rho (\theta _A; \theta _B)=1-\rho (\theta _B; \theta _A)\),

A2::

\(\rho \) is increasing in \(\theta _A\) and decreasing in \(\theta _B\),

ensuring that (i) both treatments are treated likewise and (ii) the best treatment should be favoured increasingly as its relative superiority grows.

Remark 1

Note that, on the basis of the underlying statistical model, the well-known Neyman allocation \(\rho (\varvec{\theta })=\sqrt{v_A}/(\sqrt{v_A}+\sqrt{v_B})\) - i.e., the A-optimal design—may not have any ethical appeal, since the majority of patients could be assigned to the worst treatment. Indeed, for binary and normal outcomes it does not satisfy assumption A2, while for Poisson and exponential data the Neyman target is ethical and corresponds to \(\rho _Z\) and \(\rho _R\), respectively.

Given a desired \(\rho \), RAR rules based on sequential estimation could be employed to converge to it. After a starting sample of \(n_0\) subjects assigned to both treatments to derive non-trivial estimates of the unknown parameters, at each step \(n > 2n_0\) the treatment effects are estimated by means of \(\hat{\varvec{\theta }}_n = (\hat{\theta }_{An};\hat{\theta }_{Bn})^t\) and the target is estimated accordingly by \(\rho (\hat{\varvec{\theta }}_n )\), so the next assignment is forced to converge to \(\rho \). For instance, ERADE (Hu et al. 2009) randomizes the allocations by

$$\begin{aligned} \Pr (\delta _{n + 1} = 1 \mid \delta _1,Y_1, \ldots , \delta _n,Y_n) = \left\{ \begin{array}{llll} \gamma \rho (\hat{\varvec{\theta }}_n ), &{} \text {if } \pi _n > \rho (\hat{\varvec{\theta }}_n ) \\ \rho (\hat{\varvec{\theta }}_n ), &{} \text {if } \pi _n = \rho (\hat{\varvec{\theta }}_n ),\\ 1 - \gamma \left[ 1 - \rho (\hat{\varvec{\theta }}_n )\right] , &{} \text {if } \pi _n <\rho (\hat{\varvec{\theta }}_n ) \end{array}\right. \end{aligned}$$

where \(\gamma \in [0;1)\) is the randomization parameter of the allocation process.

3 Asymptotic likelihood-based inference for RAR procedures

Assuming that the inferential goal consists in estimating/testing the superiority of a given treatment with respect to the gold standard (say A wrt B), the parameter of interest is the treatment difference \(\varDelta =\theta _A-\theta _B\), while \(\theta _B\) is usually regarded as a nuisance parameter (namely, \(\theta _B\) is a common baseline while \(\varDelta \) represents the additive effect of the relative superiority/inferiority of A over B). Although the MLEs remain the same as the non-sequential setting’s ones, this is not true for their distribution because of the complex dependence structure generated by the adaptation process. However, if the RAR design is chosen so that

C1::

\(\lim _{n \rightarrow \infty }\pi _n = \rho (\varvec{\theta })\in (0;1) \quad a.s.\)

with \(\rho (\varvec{\theta })\) satisfying assumptions A1-A2, then the standard asymptotic inference is allowed. Indeed,

$$\lim _{n \rightarrow \infty }\mathbf {M}_n = \mathbf {M} =\text {{diag}}\left( \frac{\rho (\varvec{\theta })}{v_A}; \frac{1-\rho (\varvec{\theta })}{v_B}\right) \quad a.s.$$

and the MLEs are still consistent and asymptotically normal with \(\sqrt{n} (\hat{\varvec{\theta }}_n -\varvec{\theta }) \hookrightarrow \mathcal {N} (\varvec{0}_2, \mathbf {M}^{-1})\), where \(\varvec{0}_2\) is the 2-dim vector of zeros. Thus, let \(\hat{\varDelta }_n = \hat{\theta }_{An} - \hat{\theta }_{Bn}\), then \(\sqrt{n} (\hat{\varDelta }_n - \varDelta ) \hookrightarrow \mathcal {N} \left( 0,\sigma ^2\right) \), where

$$\begin{aligned} \sigma ^2_{\rho } = \frac{v_A}{\rho (\varvec{\theta })}+ \frac{v_B}{1-\rho (\varvec{\theta })} \end{aligned}$$
(3)

and, due to the continuity of the target, \(\lim _{n \rightarrow \infty }\rho (\hat{\varvec{\theta }}_n)=\rho (\varvec{\theta })\) a.s. Letting \(\hat{v}_{jn}\)s be consistent estimators of the treatment variances, then

$$ \hat{\sigma }^2_n=\frac{\hat{v}_{An}}{\rho (\hat{\varvec{\theta }}_n)}+\frac{\hat{v}_{Bn}}{1-\rho (\hat{\varvec{\theta }}_n)} $$

is a consistent estimators of \(\sigma ^2_{\rho }\) and the \((1-\alpha )\%\) asymptotic confidence interval is

$$\begin{aligned} CI(\varDelta )_{1-\alpha }= \left( \hat{\varDelta }_n \pm \frac{z_{1-\alpha /2}\hat{\sigma }_n}{\sqrt{n}}\right) , \end{aligned}$$
(4)

where \(z_{\alpha }\) is the \(\alpha \)-percentile of \(\varPhi \).

For what concerns hypothesis testing, the inferential aim typically lies in testing \(H_0 : \varDelta = 0\) against \(H_1 : \varDelta > 0\) (or \(H_1 : \varDelta \ne 0\)). The asymptotic test is usually performed via the Wald statistic \(W_n=\sqrt{n}\hat{\varDelta }_n \hat{\sigma }_n^{-1}\) which, under \(H_0\), converges to the standard normal distribution. Thus, given the alternative \(H_1:\varDelta >0\), the power of the \(\alpha \)-level test is \(\Pr \left( \sqrt{n}(\hat{\varDelta }_n-\varDelta )> z_{1-\alpha } \hat{\sigma }_n -\sqrt{n}\varDelta \right) \), which can be approximated by

$$\begin{aligned} \varPhi \left( \sqrt{n}\varDelta \sigma _{\rho }^{-1} - z_{1 - \alpha } \right) , \quad \varDelta \ge 0, \end{aligned}$$
(5)

due to the consistency of \(\hat{\sigma }^2_n\). As stated by several authors (Lehmann 1999; Hu and Rosenberger 2006; Tymofyeyev et al. 2007), this approximation is accurate and particularly effective in the moderate-large sample setting of phase-III trials therefore neither for early phase studies with small sample sizes, nor asymptotically (where different approaches aimed at providing proper local approximation of the power around specific value of \(\varDelta \) as \(n\rightarrow \infty \) could be suitable like e.g. the local alternative framework).

Even if less interesting in the actual practice, the two-sided alternative \(H_1 : \varDelta \ne 0\) can be encompassed analogously. Under \(H_0\), \(W^2_n\) converges in distribution to a central chi-square \(\chi ^2_1\) with 1 degree of freedom; while under \(H_1\), \(W^2_n\) could be approximated by a non-central \(\chi ^2_1\) with non-centrality parameter \(n\varDelta ^2\sigma _{\rho }^{-2}\), namely the square of the crucial quantity in (5). As is well-known, the power is an increasing function of the non-centrality parameter and it is maximized by the Neyman allocation, also minimizing (3).

4 Inadequacy of likelihood-based inference

Note that condition C1 avoids the extreme scenarios \(\rho =0\) or 1; however, most of the targets suggested in the literature satisfy the following property:

$$\begin{aligned} \lim _{\theta _A\rightarrow \overline{\varTheta }}\rho (\theta _A;\theta _B)=1 \quad \text {or} \quad \lim _{\theta _A\rightarrow \underline{\varTheta }}\rho (\theta _A;\theta _B)=0, \quad \text {for every} \; \theta _B. \end{aligned}$$
(6)

It is worth stressing that, even if the symmetric assumption A1 holds, \(\rho \rightarrow 1\) as \(\theta _A\rightarrow \overline{\varTheta }\) does not imply that \(\rho \rightarrow 0\) as \(\theta _A\rightarrow \underline{\varTheta }\) and vice-versa (see, e.g., \(\rho _{PW}\) in (1)).

If \(\rho \) satisfies (6) or if the variance function of the statistical model is unbounded, then the asymptotic variance \(\sigma _{\rho }^2\) tends to diverge and the quality of the CLT approximation could be damaged, thus compromising any likelihood-based inferential procedure. This translates in both i) unreliable asymptotic confidence intervals and ii) anomalous behaviour of the power of the Wald test.

4.1 Confidence Intervals

The following Theorem shows the drawbacks of the asymptotic likelihood-based confidence intervals, that could degenerate not only for statistical models with unbounded variance, but also when the chosen target is characterized by a strong ethical component, i.e., if \(\rho \) satisfies (6).

Theorem 1

The asymptotic variance \(\sigma _{\rho }^2\) and the width of the asymptotic \(CI(\varDelta )_{1-\alpha }\) diverge if the variance function is unbounded, i.e. when \(\overline{\varTheta } = \infty \) and \(\lim _{\theta \rightarrow \overline{\varTheta }} v(\theta ) =\infty \), or if \(\rho \) is chosen so that

$$\begin{aligned} \lim _{\theta _A\rightarrow \overline{\varTheta }}\rho (\theta _A;\theta _B)=1 \quad \text {or} \quad \lim _{\theta _A\rightarrow \underline{\varTheta }}\frac{v(\theta _A)}{\rho (\theta _A;\theta _B)}=\infty , \quad \text {for every} \, \theta _B\in \varTheta . \end{aligned}$$

In particular, for exponential and Poisson data, the width of \(CI(\varDelta )_{1-\alpha }\) diverges as \(\varDelta \) grows regardless of the chosen target, while for normal homoscedastic outcomes, the asymptotic CI degenerates for every target satisfying (6). As regards binary trials, \(CI(\varDelta )_{1-\alpha }\) degenerates under \(\rho _{PW}\), while it does not diverge adopting \(\rho _R\).

Proof

The proof follows directly from (3) by noticing that condition \(\lim _{\theta _A\rightarrow \underline{\varTheta }}\rho (\theta _A;\theta _B)=0\) for every \(\theta _B \in \varTheta \) is only necessary but not sufficient, since the variance function could vanish as \(\theta _A\rightarrow \underline{\varTheta }\). For normal homoscedastic, exponential and Poisson data the proof is straightforward. For binary trials, under the PW target, the asymptotic \(CI(\varDelta )_{1-\alpha }\) degenerates, since \(\lim _{\theta _A\rightarrow \overline{\varTheta }}\rho _{PW}(\theta _A;\theta _B)=1\) for every \(\theta _B\in (0;1)\) (although \(\lim _{\theta _A\rightarrow \underline{\varTheta }}\rho _{PW}(\theta _A;\theta _B)=(1-\theta _B)/(2-\theta _B)>0\)). Adopting \(\rho _R\) instead, \(CI(\varDelta )_{1-\alpha }\) does not diverge since, for every \(\theta _B\in (0;1)\), \(\lim _{\theta _A\rightarrow \overline{\varTheta }}\rho _{R}(\theta _A;\theta _B)=(1+\theta _B)^{-1}<1\) and \(\lim _{\theta _A\rightarrow \underline{\varTheta }}\rho _{R}(\theta _A;\theta _B)=0\), but \(\lim _{\theta _A\rightarrow \underline{\varTheta }} v(\theta _A)/\rho _R(\theta _A;\theta _B)= \theta _B<1\). \(\square \)

The divergence of the asymptotic CIs strongly depends on the speed of convergence of the target to 0 or 1. For instance, taking into account \(\rho _N\) in (2), this can be severely accentuated by the effect of the tuning constant, since T induces a scaling effect by contracting/expanding the treatment difference \(\varDelta \) (for \(T>1\) or \(T<1\), respectively). Thus, small choices of T may deteriorate the quality of the CLT approximation as well as accelerate the divergence of the asymptotic variance \(\sigma _{\rho }^2\), even for values of \(\theta _A\) close to \(\theta _B\) (i.e., for values of \(\varDelta \) close to 0) and not only as \(\theta _A\) tends either to \(\underline{\varTheta }\) or \(\overline{\varTheta }\)).

Example 1

In order to stress how small values of T could severely undermine the precision of likelihood-based inferential procedure, we perform a simulation study with 100000 normal homoscedastic trials (\(v =1\)) by employing ERADE (\(\gamma = 0.5\)) with \(n=250\). Taking into account \(\rho _N\), Fig. 1 shows the simulated distributions of the MLE \(\hat{\varDelta }_n\), as \(\varDelta \) and T vary, while Table 1 summarizes the behaviour of the simulated \(95\%\) asymptotic confidence intervals for \(\varDelta \), where Lower (L) and Upper (U) bounds are obtained by averaging the endpoints of the simulated trials (within brackets the corresponding theoretical values derived by (4)).

Fig. 1
figure 1

Simulated distribution of \(\hat{\varDelta }_n\) adopting \(\rho _N\) as T and \(\varDelta \) vary

When \(\varDelta =0\), low values of T severely damage the CLT approximation leading to a non-negligible increase of the density in the tails; whereas, for \(\varDelta >0\) the distribution of \(\hat{\varDelta }_n\) presents a positive skewness, regardless of the value of T.

For \(T\ge 1\), analytical and simulated confidence bounds are quite close; however, as \(\varDelta \) grows, the impact of the skewness affects the quality of the CLT approximation. Regardless of \(\varDelta \), small values of T severely damage the accuracy of the \(CI(\varDelta )_{0.95}\), that tends to diverge extremely fast. The empirical coverage confirms the above-mentioned behaviour and tends to 1 as the width of the intervals grows. Moreover, as showed by many authors (see, e.g., Coad and Woodroofe 1998), although asymptotically unbiased, the MLEs under RAR procedures are biased for finite samples. Even for \(n=250\), \(\hat{\varDelta }_n\) tends to overestimate \(\varDelta \) for positive values of the treatment difference and this effect is exacerbated for low values of T.

4.2 Hypothesis Testing

Taking now into account hypothesis testing, for every fixed value of the nuisance parameter \(\theta _B\in \varTheta \) (and \(v \in \mathbb {R}^+\) for normal homoscedastic data), the power function (5) is governed by the non-negative function

$$\begin{aligned} t_{\rho }(\varDelta )= \frac{\varDelta }{\sigma _{\rho }}=\frac{\theta _A - \theta _B}{\sqrt{\frac{ v(\theta _A)}{\rho (\theta _A;\theta _B)}+\frac{v(\theta _B)}{1-\rho (\theta _A;\theta _B)}}},\quad \theta _A - \theta _B \ge 0. \end{aligned}$$
(7)
Table 1 Likelihood-based simulated asymptotic \(CI(\varDelta )_{0.95}\) by adopting \(\rho _N\) as T and \(\varDelta \) vary

Notice that the Wald test could present inflated type-I errors. Indeed, when \(\theta _A=\theta _B\), from assumption A1, \(\rho (\varvec{\theta })=1-\rho (\varvec{\theta })=1/2\) and therefore \(t_{\rho }(0)= 0\) for every \(\theta _B\in \varTheta \) regardless of the chosen target. Moreover, since in this case \(\sigma _{\rho } = 2\sqrt{v(\theta _B)}\), inflated type-I errors could be present only if \(v(\theta _B)\simeq 0\). This is the reason why a slightly inflation is detected in several simulation studies of both binary trials with low success probabilities and normal trials with small values of v (Zhang and Rosenberger 2012; Atkinson and Biswas 2014; Rosenberger and Lachin 2015).

Under the alternative hypothesis, the power could exhibit anomalous behaviour, especially when \(\rho \) has a strong ethical skew. In particular, we shall show that, for a given statistical model, some target allocations may induce a non monotonic power—that could also degenerate as the difference between the treatment effects grows—making the Wald test not consistent. Indeed, for every size, if \(t_{\rho }(\varDelta )\) in (7) vanishes as \(\varDelta \) grows, from (5) the power tends to \(\varPhi \left( - z_{1 - \alpha } \right) =\alpha \) (i.e., the significance level), as the following Theorem shows.

Theorem 2

When \(\overline{\varTheta }<\infty \), if \(\lim _{\theta _A\rightarrow \overline{\varTheta }}\rho (\theta _A;\theta _B)=1\) for every \(\theta _B\in \varTheta \), then the Wald test is not consistent. The same conclusion still holds when \(\overline{\varTheta }=\infty \), provided that \(\lim _{\theta _A\rightarrow \infty }(\theta _A-\theta _B)^2[1-\rho (\theta _A;\theta _B)]=0\), for every \(\theta _B\in \varTheta \). In particular, for binary trials the Wald test is consistent under \(\rho _R\), while it is not adopting \(\rho _{PW}\). Taking into account Poisson, exponential and normal homoscedastic models, \(\rho _R\) guarantees the consistency of the Wald test, while \(\rho _N\) induces the inconsistency of the test.

Proof

Given a chosen target \(\rho \), the Wald test is not consistent when \(t_{\rho }(\varDelta )\) in (7) vanishes as \(\varDelta \) grows. For \(\overline{\varTheta }<\infty \), from Theorem 1 this is satisfied iff \(\lim _{\theta _A\rightarrow \overline{\varTheta }}\rho (\theta _A;\theta _B)=1\) for every \(\theta _B\in \varTheta \). For \(\overline{\varTheta }=\infty \), the same conclusion still holds provided that as \(\theta _A \rightarrow \infty \), \(\sigma ^2_{\rho }\) diverges faster than \(\theta _A^2\). Since for the NQ class the variance function \(v(\cdot )\) is at most quadratic, this holds iff \(\lim _{\theta _A\rightarrow \infty }(\theta _A - \theta _B)^2\{v(\theta _B)/[1-\rho (\theta _A;\theta _B)]\}^{-1}=\lim _{\theta _A\rightarrow \infty }(\theta _A - \theta _B)^2[1-\rho (\theta _A;\theta _B)]=0\), for every \(\theta _B\in \varTheta \). For binary trials, assuming the PW target in (1) the power tends to \(\alpha \) as \(\varDelta \) grows, since \(\lim _{\theta _A\rightarrow \overline{\varTheta }} \rho _{PW}(\theta _A;\theta _B)=1\), for every \(\theta _B\in (0;1)\). Whereas, adopting \(\rho _R\), \(\lim _{\theta _A\rightarrow \overline{\varTheta }}\rho _{R}(\theta _A;\theta _B) =(1+\theta _B)^{-1}<1\) for every \(\theta _B \in (0;1)\) and therefore the test is consistent. Taking into account Poisson, exponential and normal homoscedastic models, adopting \(\rho _R\) the test is consistent since \(\lim _{\theta _A\rightarrow \infty }(\theta _A - \theta _B)^2[1-\rho _{R}(\theta _A;\theta _B)]=\theta _B (\theta _A - \theta _B)^2 (\theta _A+\theta _B)^{-1} =\infty \) for every \(\theta _B\in \mathbb {R}\) (even if \(\lim _{\theta _A\rightarrow \infty }\rho _{R}(\theta _A;\theta _B) =1\)). By using \(\rho _N\) the test is not consistent since \(\lim _{\theta _A\rightarrow \infty }(\theta _A-\theta _B)^2[1-\rho _{N}(\theta _A;\theta _B)]=0\) for every \(\theta _B\in \mathbb {R}\). \(\square \)

Remark 2

Although condition \(\lim _{\theta _A\rightarrow \overline{\varTheta }}\rho (\theta _A;\theta _B)=1\) is always necessary for the inconsistency of the Wald test, for binary trials it is also sufficient, making the PW rule unsuitable for likelihood-based inference. Excluding the binary case, in order to reliably apply the Wald statistic, \(\rho \) should satisfy \(\lim _{\theta _A\rightarrow \infty }(\theta _A-\theta _B)^2[1-\rho (\theta _A;\theta _B)]>0\) for every \(\theta _B\in \varTheta \).

Remark 3

Although our approach complements the one of Yi and Li (2018), Theorems 1 and 2 clearly conflict with their results. In particular, the authors show that the Wald statistic achieves the upper bound of the asymptotic power and derive the rates of coverage error probability of the corresponding confidence intervals. Their results depend on the boundedness of the remainder term in the Taylor expansion of Lemma 1 in Yi and Li (2018), where the authors state that if \(\rho \in (0;1)\) then there exists \(r\in (0;1/2]\) such that \(r\le \rho \le 1-r\). However, this condition does not hold for targets satisfying (6) (for instance, \(\not \exists r \in (0;1/2]\) bounding \(\rho _N\)).

Example 2

To underline how the adoption of the PW target could severely undermine the reliability of the Wald test, we perform a simulation study with 100,000 binary trials by employing ERADE (\(\gamma = 0.5\)). Figure 2 shows the simulated power as \(\varDelta \) varies for \(\theta _B=0.7\), 0.8 and 0.9 for different sample sizes.

Fig. 2
figure 2

Simulated power of the Wald test adopting \(\rho _{PW}\) as \(\theta _B\) and n vary

As theoretically proved, the power tends to the significance level \(\alpha \) regardless of the sample size. Moreover, the power function is decreasing not only at \(\theta _A \approx 1\) but also for smaller and potentially crucial differences between the treatment effects, especially for small samples. For instance, when \(n=100\), for \(\theta _B=0.9\) the maximum power is about 25% attained at \(\varDelta =0.07\) (i.e., \(\theta _A=0.97\)), while for \(\theta _B=0.8\) the power is always lower than 75% and rapidly decreases for \(\varDelta \ge 0.16\). Even with \(n=250\), the power does not reach 1 when \(\theta _B > 0.8\); although such a degenerating behaviour is attenuated as the sample size increases, it still persists also for \(n=400\).

An additional drawback of the PW target is related to its functional form. Indeed, although condition A2 is satisfied (namely, \(\rho _{PW}\) is decreasing in \(\theta _B\) and therefore \(1-\rho _{PW}\) is increasing in \(\theta _B\)), for any fixed difference \(\varDelta =\theta _A-\theta _B\), the allocation to B is decreasing in \(\theta _B\) as the following table shows.

Table 2 The behaviour of the treatment allocation proportions adopting \(\rho _{PW}\) for \(\varDelta = 0.2\)

Indeed, the PW target could be rewritten as

$$ \rho _{PW}(\theta _A; \theta _B)=\frac{1- \theta _B}{2(1-\theta _B) - (\theta _A - \theta _B)} $$

and therefore, for a fixed difference \(\theta _A - \theta _B\), its derivative wrt \(\theta _B\) is

$$\frac{\theta _A - \theta _B}{[2(1-\theta _B)-(\theta _A - \theta _B)]^2}> 0, \quad \theta _A > \theta _B$$

leading to a negative derivative wrt \(\theta _B\) of \(1-\rho _{PW}\) (i.e., the target allocation of treatment B).

Besides consistency, an additional natural requirement of the test is that the power should be monotonically increasing in \(\varDelta \) (i.e., in \(\theta _A\) for every \(\theta _B \in \varTheta \)), in order to identify with high precision the best treatment as its relative superiority grows. From (7), provided that \(\rho \) is differentiable, the power of the Wald test is increasing iff, for every \(\theta _B\in \varTheta \),

$$\begin{aligned} \frac{2\sigma ^2_{\rho }}{\theta _A - \theta _B} \ge \frac{v'(\theta _A)}{\rho (\theta _A;\theta _B)} + \rho '_{\theta _{A}}(\theta _A;\theta _B)\left\{ \frac{v(\theta _B)}{[1 - \rho (\theta _A;\theta _B)]^2} - \frac{v(\theta _A)}{\rho ^2(\theta _A;\theta _B)}\right\} , \quad \theta _A > \theta _B \end{aligned}$$
(8)

where \(f'_{x}= \partial f/\partial x\) denotes the partial derivative of f wrt x (to avoid cumbersome notation, we shall omit the subscript for the derivative of scalar functions). In addition to the statistical model, condition (8) regards the chosen target and needs to be satisfied for every \(\theta _A > \theta _B\), involving the entire functional form of \(\rho \) (not only its limits and the speed of convergence to them as in Theorems 1 and 2). Clearly, if the target induces the inconsistency of the test, then (8) fails to hold, instead if \(\rho \) guarantees the consistency of the test, it does not necessarily ensure the monotonicity of the power, as shown in Fig. 5. For instance, as also discussed by Baldi Antognini et al. (2018), for normal homoscedastic data \(v'=0\) and the power is increasing in \(\varDelta \) iff \(\rho \) is chosen so that, for every \(\theta _B\in \mathbb {R}\)

$$\begin{aligned} \rho (\theta _A;\theta _B)[1-\rho (\theta _A;\theta _B)] \ge (\theta _A - \theta _B)[\rho (\theta _A;\theta _B)-1/2]\rho '_{\theta _A}(\theta _A;\theta _B), \quad \theta _A>\theta _B. \end{aligned}$$
(9)

Clearly, this condition fails to hold for \(\rho _N\), while it is satisfied by \(\rho _R\). Analogously, for binary trials adopting \(\rho _{PW}\) the power of the Wald test is not monotonically increasing. Indeed, condition (8) can be restated as

$$\begin{aligned} \frac{2}{\theta _A - \theta _B} - \frac{\theta _A - \theta _B}{(2-\theta _A- \theta _B)(1-\theta _A)} \ge \frac{(1-2\theta _A)(1-\theta _A) + \frac{(\theta _A - \theta _B)(\theta _A +\theta _B-1)(1-\theta _B)}{2 - \theta _A-\theta _B}}{\theta _A(1-\theta _A)^2 + \theta _B(1-\theta _B)^2}, \end{aligned}$$

where, for every \(\theta _B \in (0;1)\), as \(\theta _A\) tends to \(\overline{\varTheta } = 1\) the LHS tends to \(-\infty \) while the RHS tends to \(1/(1-\theta _B)>0\).

Proposition 1

For normal, binary, exponential and Poisson data, \(\rho _R\) always guarantees that the power of the Wald test is monotonically increasing in \(\varDelta \).

Proof

For the normal homoscedastic model, inequality (9) is trivially satisfied since \(2\theta _A(\theta _A+\theta _B)\ge (\theta _A - \theta _B)^2\) for every \(\theta _A\ge \theta _B>0\). For Poisson and exponential data, condition (8) still holds since, for every \(\theta _B \in \mathbb {R}^+\),

$$\begin{aligned} \frac{\theta _B}{\theta _A +\theta _B} \le 1 \le 1 + \frac{4\theta _A\theta _B}{\theta _A^2 -\theta _B^2}, \quad \theta _A\ge \theta _B>0. \end{aligned}$$

In the context of binary trials, inequality (8) becomes

$$\begin{aligned} \theta _A\{\theta _A - \theta _B + 2 \theta _B(2-\theta _A - \theta _B)\} \ge 0 \end{aligned}$$

which is clearly satisfied for \(1> \theta _A \ge \theta _B > 0\). \(\square \)

As previously discussed, \(\rho _R\) is able to preserve the fundamental properties of the Wald test, namely the consistency and the monotonicity of its power. However, this target strongly depends on the nuisance parameter \(\theta _B\); indeed, for a fixed difference \(\varDelta \), as \(\theta _B\) grows \(\rho _R(\theta _A;\theta _B)\rightarrow 1/2\) and, therefore, its ethical improvement tends to vanish as well as the induced power. For instance, from (7), under exponential outcomes \(t_{\rho _R}(\varDelta )=\varDelta /(\theta _A+\theta _B)\), while for Poisson data \(t_{\rho _R}(\varDelta )=\varDelta /\sqrt{2(\theta _A+\theta _B)}\) and both of them vanish as \(\theta _B\) grows, for every fixed \(\theta _A\). Figure 3 confirms graphically the crucial role played by \(\theta _B\) in terms of power: given a difference \(\varDelta =0.5\), under the exponential model the power decreases from 0.94 to 0.10 as \(\theta _B\) grows from 1 to 10 (while for Poisson data it goes from 0.97 to 0.34).

Fig. 3
figure 3

Simulated power of the Wald test for exponential and Poisson data adopting \(\rho _R\) with \(n=250\)

5 A possible solution for likelihood-based inference: the re-scaled target

From Theorems 1 and 2, it is quite evident that some anomalous behaviours could be prevented by assuming a target that is not characterized by a strong ethical component, namely under which (6) fails to hold. Indeed, if the target is chosen so that \(0<l_1\le \rho (\varvec{\theta }) \le l_2<1\) for every \(\varvec{\theta }\), then the Wald test is consistent, while \(CI(\varDelta )_{1-\alpha }\) does not diverge provided that \(v(\cdot )\) is bounded.

Moreover, to mitigate the effects of the nuisance parameters, a possible way consists in adopting targets that depend only on the treatment difference \(\varDelta \) and not on \(\theta _B\), namely \(\rho =\rho ^{\star }(\varDelta )\); however, this is only a partial solution, since the nuisance parameter affects any likelihood-based inferential procedure through the variance function. In this setting, assumptions A1-A2 become

A::

\(\rho ^{\star }\) is continuous and increasing with \(\rho ^{\star }(\varDelta )=1-\rho ^{\star }(-\varDelta )\).

For instance, under normal, Poisson and exponential data \(\rho ^{\star }\) could be interpreted as the cdf of a continuous r.v. with support in \(\mathbb {R}\) and symmetric around 0, as \(\rho _N=\rho ^{\star }_N\) in (2) (Baldi Antognini et al. 2018). While, for binary trials, the target

$$\begin{aligned} \rho ^{\star }_G(\varDelta )=\frac{1}{2}+\frac{\omega \varDelta }{2(2-\omega )}, \quad \varDelta \in (-1;1), \end{aligned}$$

is the asymptotic allocation of the doubly-adaptive weighted difference design, suggested by Geraldes et al. (2006). It is obtained by a suitable weighted combination of two linear randomization functions, one for ethics and the other dictated by balance, where \(\omega \in [0;1]\) reflects the relative importance of ethics. Note that \(\rho ^{\star }_G\) guarantees the consistency of the Wald test and the reliability of the CIs, since as \(\theta _A \rightarrow \overline{\varTheta } = 1\), \(\rho ^{\star }_G(\varDelta ) \rightarrow (2-\omega )^{-1}<1\), while as \(\theta _A \rightarrow \underline{\varTheta } = 0\), \(\rho ^{\star }_G(\varDelta )\rightarrow (1-\omega )/(2-\omega )>0\), for every \(\omega <1\).

By combining these suggested solutions, even when the desired \(\rho ^{\star }\) is characterized by a strong ethical improvement, a possible way to overcome some degeneracies consists in re-scaling the target, namely by letting

$$\begin{aligned} \rho ^{\star }_{r}(\varDelta ) = 1-r + \rho ^{\star }(\varDelta ) (2r-1), \quad \text {with} \quad r \in (1/2;1). \end{aligned}$$
(10)

Transformation (10) simply contracts the image of \(\rho ^{\star }\), which is re-scaled in \([1-r;r]\), while preserving its functional form. Clearly, for \(r=1\) no re-scaling transformation is applied, namely \(\rho ^{\star }_{1}(\varDelta )=\rho ^{\star }(\varDelta )\), while the case \(r=1/2\) corresponds to the balanced allocation.

Fig. 4
figure 4

Simulated distribution of \(\hat{\varDelta }_n\) adopting \(\rho _{N_r}^{\star }\) (\(r=0.9\)) as T and \(\varDelta \) vary

Although the anomalous scenarios induced by the unboundedness of the variance function—i.e., by the statistical model—cannot be overcome, by adopting \(\rho ^{\star }_{r}\) some degeneracies caused by the target could be avoided, since the Wald test is consistent and \(CI(\varDelta )_{1-\alpha }\) does not diverge.

Remark 4

Since under condition C1 the treatment allocation proportion \(\pi _n\) of a RAR design is a consistent estimator of the target, another possible way to overcome some drawbacks of likelihood-based asymptotic procedures consists in estimating \(\sigma _{\rho }^2\) by \(\breve{\sigma }^2_n=\hat{v}_{An}/\pi _n+\hat{v}_{Bn}/[1-\pi _n]\). Indeed, given a starting sample of \(2n_0\) assignments, for any fixed n, \(\pi _n\in [ \eta _n;1-\eta _n]\), where \(\eta _n=n_0/n\in (0;1/2)\) is the percentage of (non-adaptive) allocations initially made on either treatment. In practice, \(\pi _n \simeq \rho (\hat{\varvec{\theta }}_n )(1-\eta _n)+[1-\rho (\hat{\varvec{\theta }}_n )]\eta _n\), that substantially corresponds to assume a re-scaled target with \(r=r(n)=1-\eta _n\). Unfortunately, this approach could be useful only for clinical trials where \(\eta _n\) is non-negligible (i.e., for quite small samples), otherwise \(n_0\) should be chosen as an increasing function of n (Baldi Antognini et al. 2018).

Although the re-scaling correction could also be applied to targets depending on nuisance parameters, in general it does not protect against the non monotonicity of the power function discussed in Section 4. However, since \(0< \rho '_{r_{\theta _A}}=(2r-1) \rho _{\theta _A}' < \rho _{\theta _A}'\), then monotonicity condition (8) tends to be satisfied as r decreases (namely when the target tends to be balanced); thus, as it will be shown in Examples 3 and 4, this drawback could be strongly mitigated/overcome by re-scaling the target with a proper choice of r.

Example 3

To show how a re-scaled target not depending on the nuisance could improve the precision of likelihood-based inference, we perform a simulation study in the same setting of Example 1 by adopting \(\rho ^{\star }_{N_r}\) with \(r=0.9\). Figure 4 shows the simulated distributions of \(\hat{\varDelta }_n\) as T and \(\varDelta \) vary, while Table 3 summarises the behaviour of the simulated \(95\%\) asymptotic confidence interval for \(\varDelta \), where Lower (L) and Upper (U) bounds are obtained by averaging the endpoints of the simulated trials (within brackets the theoretical values derived by (4)).

Table 3 Likelihood-based simulated asymptotic \(CI(\varDelta )_{0.95}\) adopting \(\rho _{Nr}^{\star }\) (\(r = 0.9\)) as T and \(\varDelta \) vary

Adopting \(\rho _{N_r}^{\star }\), the reliability of the \(CI(\varDelta )_{0.95}\) drastically increases: analytical and simulated bounds almost coincide for every value of T and \(\varDelta \). Although for small values of T (i.e., for a high ethical component) the width of the confidence intervals slightly grows, this does not compromise the inferential precision. By limiting the skewness and the variability of the MLE’s distribution, the re-scaled target significantly improves the accuracy of the asymptotic confidence intervals, also confirmed by the empirical coverage which is always quite close to the nominal value. Note that the re-scaling correction seems also to reduce the bias of the MLEs, in particular for higher values of the treatment difference.

As regards hypothesis testing, Fig. 5 shows the power of the Wald test adopting \(\rho _{N_r}^{\star }\) as T and r vary (the case \(r=1\) corresponds to \(\rho _{N}^{\star }\)).

Fig. 5
figure 5

Power of the Wald test for normal outcomes adopting \(\rho _{N_r}^{\star }\) and \(\rho _{N}^{\star }\) as T and r vary

Regardless of the values of T, the re-scaled target (i.e., \(r<1\)) always preserves the consistency of the test. However, this target does not satisfies condition (9) and, for small values of T, the decreasingness of the power is accentuated as r tends to 1. Even for \(T=0.5\) or \(T=0.3\), by selecting \(r\le 0.95\), monotonicity condition (9) is fulfilled; in this way the ethical component of the target could be strongly improved without compromising inference.

Example 4

Ideally, the re-scaling correction should be applied to targets with a strong ethical skew—i.e., satisfying (6)—that (i) fulfill (8) to guarantee a monotonic power function of the Wald test and (ii) depend on the treatment effects only through the difference \(\varDelta \) (to mitigate the effects of the nuisance parameters). As previously shown, when adopting \(\rho _{PW}\) none of these conditions is satisfied; however, the re-scaled version \(\rho _{PW_r}\) could still overcome or mitigate some of the above-mentioned drawbacks. To see this, we perform a simulation study in the same setting of Example 2, by comparing the performances of \(\rho _{PW}\) and \(\rho _{PW_r}\) with \(r=0.9\). Figure 6 shows the simulated power of the Wald test as \(\varDelta \) varies for \(\theta _B=0.7\), 0.8 and 0.9 for \(n=100\), 250 and 400, while Table 4 summarizes the behaviour of the simulated \(95\%\) asymptotic confidence interval for \(\varDelta \), where Lower (L) and Upper (U) bounds are obtained by averaging the endpoints of the simulated trials (within brackets the theoretical values derived by (4)). If compared to \(\rho _{PW}\) (see Fig. 2), the re-scaled target \(\rho _{PW_r}\) guarantees the consistency of the Wald test, also strongly improving the behaviour of the power function. The improvement in the inferential precision is remarkable: for instance, with \(n = 100\) and \(\theta _B = 0.9\), for \(\varDelta = 0.08\) the power is about \(40\%\) with a gain of \(13\%\) wrt the non re-scaled version, while for \(n=250\) the power increases of \(18\%\). For what concerns CIs, although \(\rho _{PW}\) performs quite well, the asymmetric distribution of the MLEs causes a right shift of the CI with a slight increase in the width (that is exacerbated for \(\theta _A > 0.95\)). On the other hand, the adoption of \(\rho _{PW_r}\) leads to narrower and centered CIs with a correct empirical coverage.

Fig. 6
figure 6

Simulated power of the Wald test adopting \(\rho _{PW_r}\) (with \(r=0.9\)) as \(\theta _B\) and n vary

Table 4 Likelihood-based simulated asymptotic \(CI(\varDelta )_{0.95}\) by adopting \(\rho _{PW}\) and \(\rho _{PW_r}\) (\(r = 0.9\)) with \(n=250\) and \(\theta _B = 0.7\), as \(\varDelta \) varies

6 Discussion

This paper explores in depth the limitations of the likelihood-based approach for RAR experiments, in terms of asymptotic confidence intervals and hypothesis testing. Although clinical trials represent one of the most actual fields of application of this methodology (because of the main concern about the ethical impact on the subjects’ care), RAR procedures could be a useful tool for local optimality problems also in different contexts like, e.g., industrial experiments. First of all, we show that some RAR rules as well as some targets can compromise the asymptotic likelihood-based inference, inducing a degenerating behaviour of the power of the Wald test and unreliable CIs. This is particularly true when the empirical evidence strongly suggests the superiority of one treatment wrt the other or when the ethical component of the target is remarkable, since this could induce the target to approach either 0 or 1. Furthermore, these anomalies may also be caused by statistical models with unbounded variance, and inference could also be strongly compromised due to the effect of nuisance parameters.

Our results show that, in general, \(\rho _R\) is able to preserve the fundamental properties of hypothesis testing, because it guarantees the consistency of the Wald test as well as the monotonicity of its power; however, its dependence on the nuisance parameter could damage the inferential precision. On the other hand, the PW rule confirms its practical inadequacy since i) the asymptotic CIs diverge and ii) the power of the Wald test is decreasing and tends to the significance level as the difference between the treatment effects grows, thus severely undermining the inferential precision.

Inspired by the common practice of superimposing a minimum percentage of allocations for each treatment, several authors have recently taken into account RAR procedures with a minimum prefixed threshold in the assignments to avoid possible degeneracies (see Tymofyeyev et al. 2007; Sverdlov et al. 2011; Sverdlov and Rosenberger 2013; Villar et al. 2015b). In this paper, we prove how a re-scaling correction of the target could preserve some of the fundamental properties of likelihood-based inference. In particular, we show that, by adopting a re-scaled target, the consistency of the Wald test and the reliability of the CIs are ensured (provided that the variance function is bounded), even with a high ethical component. Moreover, choosing a suitable threshold r significantly improves the accuracy of the asymptotic likelihood-based CIs (also confirmed by the empirical coverage which is quite closed to the nominal value) and overcomes the non monotonicity of the power function. Generally, a choice of \(r=0.9\) preserves the inferential accuracy, regardless of the statistical model and of the adopted target. As regards \(\rho _N\), \(r=0.9\) matched with \(T\ge 0.5\) guarantees good performances in terms of both ethics and inference. Clearly, these results could also be applied to the class of Bayesian RAR designs, where frequentist likelihood-based inference is performed at the end of the trial. Indeed, Bayesian RAR procedures could also present possible degeneracy in the treatment allocation proportions and therefore a re-scaling correction could represent a valid tool for inference. For instance, as recently discussed by Villar et al. (2018) for the case of several treatments, superimposing a minimum percentage of allocation to the control group produces robust inference by preserving type-I errors even in the case of time trends.

However, in some circumstances, other critical issues related to the unboundedness of the variance function and the effect of the nuisance parameters cannot be circumvented by simply re-scaling the target. This is the case, for example, of \(\rho _{R}\) and \(\rho _{Z}\) under exponential and Poisson responses, respectively (namely, the corresponding Neyman allocations); their re-scaled versions, while maintaining the same inferential performances of the non re-scaled counterparts, do not protect against neither the strong dependence on the nuisance parameter nor the unboundedness of the variance function. In such situations, alternative inferential approaches could be preferable and one of the most promising is randomization-based inference (Wei 1988; Rosenberger 1993). Under this framework, the equality of treatment groups corresponds to an allocation in which the assignments are unrelated to the responses; inference is thus carried out by computing the distribution of the treatment allocations conditionally on the observed outcomes, that are treated as deterministic. Since the distribution of the test depends on the chosen RAR rule, exact results are quite few and, generally, p-values and the endpoints of confidence intervals are computed by Monte Carlo methods (for recent contributions see Wang et al. 2020 for randomization tests and Wang and Rosenberger 2020 for randomization-based interval estimation).

Our results are focussed on the case of two treatments, but a suitable extension to the multi-armed case could be very relevant. Indeed, for \(K>2\) treatments, multiple comparisons between the treatment groups should be taken into account for inference (some of them with possibly different importance, due to e.g., previous knowledge about a gold standard, the presence of a control arm). As showed by Tymofyeyev et al. (2007), Sverdlov et al. (2011) and Baldi Antognini et al. (2019), the optimal design maximizing the power of the Wald test of homogeneity is a degenerate allocation involving only the best and the worst treatments without observations on the intermediate ones (here, the treatment order is the usual stochastic order between random variables). This clearly leads to unreliable inference about the treatment contrasts and, at the same time, problems also arise from the ethical viewpoint, since more than half of the patients could be assigned to the less effective treatment. A re-scaling transformation can still be applied for multidimensional target \(\varvec{\rho }^t=(\rho _1,\ldots , \rho _K)\) with \(\rho _i\ge 0\) and \(\sum _{i=1}^K\rho _i=1\) by letting, analogously to (10),

$$\rho _{ir}=(1-r)(1-\rho _i)/(K-1)+r\rho _i,\quad \text { for } i=1,\ldots ,K, \quad \text { with } r\in (1/K;1),$$

which ensures that \(\rho _{ir}\in [(1-r)/(K-1);r]\) and \(\sum _{i=1}^K\rho _{ir}=1\). However, in this setting the impact of the re-scaling correction in terms of estimation efficiency and power needs to be studied. This topic, as well as proper comparisons between likelihood-based and randomization-based inference, is left for future research.