Background

The primary objective of phase II trials in oncology is to identify the agents or treatments that are sufficiently efficacious in antitumor activity to warrant further investigation in phase III trials. The tumor response rate (RR) is a common primary endpoint used to indicate possible antitumor activity for a study treatment in phase II cancer clinical trials [1]. However, studies of a few novel agents in recent years have revealed that other endpoints, such as the time to progression (TTP) or progression-free survival (PFS), are also relevant in assessing the antitumor activity of various new agents [25]; this is because several of the studied agents have been reported to prolong the TTP or PFS instead of improving tumor RR [6, 7]. For example, despite a low tumor RR [8], agents such as sorafenib in renal cell carcinoma have been observed to have significant PFS and overall survival benefits [6]. Therefore, relying on a single traditional RR can lead to an unexpectedly high type II error, meaning that promising drugs are likely to be missed because of a lack of observed activity.

Phase II trials can also be used to rapidly terminate inefficacious drugs that do not warrant further development. The multistage design, which is typically a two-stage design, was developed to screen out inactive drugs at the interim stages. One of the advantages of this method is that it enables early termination of a futile study and consequently patient resources can be conserved for other studies. Fleming [9] and Simon [10] proposed multistage designs where tumor RR is the only endpoint in assessing antitumor activity for drugs. Although phase II designs based on TTP or PFS endpoints have received considerable attention in the past decade [1113], using the time-to-event (TTE) endpoint alone requires a longer period for assessing the outcome, which may not be an ideal screening tool for selecting active drugs and terminating inactive drugs [14]. In some clinical trials, a new therapeutic agent at the time of phase II development might have uncertain levels of drug activity, regardless of the extent to which it has been studied, and whether the TTE endpoint alone is the optimal choice is unclear. Therefore, combining both the response endpoint and the TTE endpoint for assessing new agents is a logical option. For example, in a Phase II study of antisense AEG35156 in combination with sorafenib for advanced hepatocellular carcinoma (HCC), the original design was to use TTP alone as primary endpoint because the drug activity of this antisense was expected to improve TTP more so than tumor RR. However, the benefits based on TTP were not obvious enough to show drug activity but tumor RR did. The results have shown that the median TTP was 4.0 months and 2.6 months for the study treatment and control arms, respectively. The primary TTP was in favor of the study treatment but did not reach statistical significance since the sample size was designed to be small in this Phase II study. It was further pointed out that patients who had dose modifications according to protocol did significantly better in TTP than those who had no dose reduction, possibly due to potential side effects. The response status as an outcome showing short term drug activity may also contribute, and in this example we have observed a clear treatment activity in the study treatment and no activity in the control (5 versus 0 responses). If we had used a mixed endpoints design in the first place, we would have shown the drug was active in this study [15].

Zee et al. [16] and Sun et al. [17] proposed a multinomial design to accommodate both tumor response and progressive disease in evaluating the effectiveness of a study agent, in which the early progressive disease (EPD) rate was incorporated into the composite hypothesis setting. The additional information from the EPD endpoint enabled the multinomial design to provide a better decision rule than those based on the response endpoint alone, with a higher probability of early stopping and smaller expected sample size [18]. However, the lack of concordance between binary EPD at a fixed time point in phase II trials and TTE endpoints such as TTP, PFS, and overall survival (OS) in subsequent phase III studies suggests that EPD may not be the most appropriate endpoint for developing a multistage phase II design [5]. The discrete characteristic of the EPD endpoint may lose crucial information because of its arbitrary definition in the choice of a fixed time point in the evaluation [19].

In this study, we considered using the tumor RR and a TTE endpoint such as TTP or PFS, instead of the dichotomized EPD variable, for developing a stopping rule for multistage single-arm phase II trials. Because of the association between TTE and RR endpoints within the same patient, and because ignoring such an association can lead to higher type I or type II errors, we adopted the Gaussian copula method to model the dependence structure between a binary RR endpoint and a continuous TTE endpoint. If the tumor response probability is determined by a normal variable through the probit model and the underlying TTE is assumed to follow an exponential distribution, then the dependence between RR and TTE is expressed as a correlation between the underlying normal variable and the exponential variable. Under these conditions, our design allows early rejection of drugs if they have an unacceptably low RR after stage I and a short median TTE. The Methods section describes the multistage hypothesis-testing procedure based on the copula model. The Results section reports a simulation study conducted to assess the performance of the proposed design under various correlation settings.

Methods

In our phase II clinical trial design with tumor response and TTE endpoints, the null hypothesis and the alternative hypothesis are expressed as

$$ {H}_0:\ \left(p\le {p}_0\ \mathrm{and}\ {T}_{med}^{*}\le {T}_0\right)\kern1em \mathrm{versus}\kern0.75em {H}_1:\ \left(p>{p}_1\ \mathrm{or}\ {T}_{med}^{*}>{T}_1\right) $$
(1)

where T * med is the true median TTE T* that is assumed to follow exponential distribution with hazard rate λ and hence T * med  = ln 2/λ. Expecting that a correlation between tumor response endpoint and TTE endpoint exists is logical because a high RR is typically related to a long TTE, particularly in studies with cytotoxic agents [20]. Therefore, in this design, the dependence between the probability of response and the hazard rate function for the TTE endpoint is modeled using a Gaussian copula (Appendix section A). We further assumed that the censoring time T C i is noninformative (i.e., the marginal density function of tumor response and the true TTE and the dependence structure are not affected by censoring once the copula is prespecified in the design). In practice, the censoring observations in most trials affect the true median TTE. Therefore, for the null hypothesis that experimental treatment is inactive, the decision to reject the null hypothesis can be made based on the Kaplan–Meier median T med derived from the observed TTE min{T i *, T i C}, i = 1, … which is consistent with the true median TTE in distribution [21], as well as the total number of tumor responses ∑ Ni = 1 Yi where Yi is the tumor response indicator of the ith patient (Appendix). However, deriving the analytical form of the joint distribution of these two statistics by using the copula structure is not possible. Hence, we used a simulation-based approach (Appendix section B) to specify the critical values. The censoring time was generated independently from an exponential distribution with hazard rate λC to obtain the observed time because noninformative censoring was assumed in the design. To achieve the predetermined censoring rate rC for the TTE, the censoring hazard rate was set as λC = λrC/(1 − rC), which implies that the hazard rate λC in generating early stopping rules for futility (H1) differs from that for activity (H0). For simplicity, we considered the censoring rates in both H0 and H1 to be the same, although the design allows distinct censoring for H0 and H1. Because the correlation coefficient ρ specified in copula (Appendix section A) influences the decision boundary, the simulation-based method was adopted under a different correlation setting to obtain the appropriate decision criterion. We assumed a positive correlation because a high RR is likely to be associated with a long median TTE regarding TTP or PFS, when treatment is expected to be active.

In many clinical trials, most investigators prefer to continue the study to improve the estimation accuracy at the interim analysis, even if early rejection criteria of the null hypothesis are fulfilled. Therefore, we considered only early stopping for futility in the proposed design. We developed an R computer program for determining the phase II stopping criteria for the proposed method. Users need to specify the parameters for the hypotheses, the nominal type I and type II errors, and a fixed sample size for stage I and the final stage of the study before we generated a stopping criterion.

Additional file 1: Tables S1a to S1d show the results of the two-stage stopping rules for the hypotheses, with a total sample size of 30 and 15 patients used in the first stage at α = 0.05 and 1 − β = 0.8 at a censoring rate rC = 0.1. The corresponding errors at the early stage (the first stage) are α1 = 0.01 and β1 = 0.1 according to the error-spending-function method. In the first example (Additional file 1: Table S1a), a low RR of p0 = 0.05 and a short median TTE of T0 = 3 were set for the null hypothesis, and p1 = 0.2 and T1 = 4.5 were set for the alternative hypothesis. At the first stage with 15 patients, when the correlation between tumor response and the underlying true TTE is high (e.g., ρ = 0.8), the null hypothesis would be accepted and the treatment would be rejected if we observed: (1) no response with a median TTE of 6.9 or lower; (2) one response with a median TTE of 4.1 or lower; (3) two responses with a median TTE of 3.0 or lower; or (4) three responses with a median TTE of 2.7 or lower. For the final stage, the treatment would be considered efficacious if we observed: (1) a median TTE of 4.7 or higher with any number of responses; (2) one response with a median TTE of 4.6 or higher; (3) two or more responses with a median TTE of 4.5 or higher; (4) three or more responses with a median TTE of 4.1 or higher; (5) four or more responses with a median TTE of 3.2 or higher; or (6) five or more responses with any median TTE.

When implementing the proposed design for a specific trial, we suggest calculating the correlation based on RRs and TTEs from historical evidence. For example, if conducting a phase II trial for assessing bevacizumab activity, an angiogenesis inhibitor that slows the growth of new blood vessels, among patients with hepatocellular carcinoma (HCC), eight phase II trials on bevacizumab from 2006 to 2012 could be obtained using PubMed [2229]. The results in Table 1 reveal that the Spearman correlation is estimated to be 0.90, which could be used as a correlation estimate for the proposed design. If historical results are unavailable, which could be the case for a relatively new drug, a search method could be applied to the interim data for estimating the most likely correlation. We first fixed the correlations and applied them in the copula setting to calculate the value of the likelihood function by using the interim data [30]. The correlation that provides the highest likelihood value would be an appropriate choice for the design. When interim data are unavailable at the design stage, the decision rules can be generated under several conditions (e.g., correlations from 0.1 to 0.9, in 0.1 increments) to provide a correlation range for reference. We may choose the maximal number of patients conservatively, and subsequently apply the likelihood function estimation to determine the most appropriate stopping rules after interim data are obtained. The calculation program for either historical results or interim data will be made available on our website http://www2.ccrb.cuhk.edu.hk.

Table 1 The tumor response and median PFS of phase II HCC trials on bevacizumab, 2006-2012

Results

Simulation study

The simulation study was conducted to evaluate the operating characteristics of the proposed design. For each scenario, 1,000 samples were generated from the copula model (Appendix section A) to test the stopping criterion specified by P0, T0, P1, T1 and ρ. With a total of 30 patients and 15 in the early stage, the results in Table 2 (Rows 1–12) indicate that the proposed stopping rules can generally achieve the desired type I error and power when high RRs and a high hazard ratio between null and alternative are expected (P0 = 0.2, T0 = 4, P1 = 0.4 T1 = 8 and P0 = 0.3, T0 = 4, P1 = 0.5 T1 = 8). The type I error in low RR and low hazard ratio design (P0 = 0.05, T0 = 3, P1 = 0.2 T1 = 4.5) is higher than desired at the 5 % level and the power is slightly lower than 80 % when RR increased to P0 = 0.1, P1 = 0.3. When the sample size increased to n = 40 (the last eight rows in Table 2), the type I error and power improved to the acceptable level. In other words, when we expect a low tumor RR in the study (e.g., cytostatic trial), a sample size of 40 may be appropriate to achieve the desired type I error and power. If the expected RR is high (e.g., chemotherapy in multiple myeloma trials), 30 patients would be sufficient when using the proposed phase II design. The early stopping probability and expected sample size are also affected by a correlation between tumor response and TTE. The early stopping probability enhanced when correlation decreased, and thus less expected sample size was required with lower correlation between the two endpoints in the trial. This is reasonable because the more independent the two endpoints in the study are, the more information is available for statistical inference.

Table 2 Simulation results of two-stage design for testing H0: P ≤ P0 & T*med ≤ T0 vs. H1: P > P1 or T*med > T1 at the nominal level α = 0.05 and 1-β = 0.80a

From the decision rules in Additional file 1: Tables S1a to S1d, a study agent could be declared to be active with a sufficiently high RR but with an extremely short median TTE, or if the median TTE is sufficiently long, but with a small number of responses. Another situation is if the treatment has a relatively high median TTE or RR but is claimed to be inactive, e.g., a trial with no response has a median TTE of 6.8 months at the first stage could be stopped early in the first block of Table S1a, despite 6.8 months being 2.3 months in excess of what is defined by the alternative hypothesis. We considered these clinically ambiguous situations in the simulation for assessing the performance of the generated decision rules. The overall type I (an inactive drug incorrectly claimed as active) and type II errors (an active drug incorrectly claimed as inactive) were still under the desired level (α <0.05, β <0.2), suggesting that the error of claiming active or inactive drugs with contradictory results is small.

In generating the decision rules, we employed the exponential distribution assumption for the TTE outcome. To assess the sensitivity of the distribution assumption for the TTE outcome, we performed another simulation where we assumed a more general Weibull distribution with a diverse shape parameter k. The Weibull distribution is equivalent to the exponential distribution when k = 1, implying a constant hazard during the study. The hazard decreases with time if k <1 and increases with time if k >1. The results in Table 3 indicate that the type I and type II errors increase if the decision rules based on the constant hazard assumption are applied, but they were observed to decrease. When the magnitude of hazard decrease is large (k = 0.5) and the total sample size is 40, the type I and type II errors deteriorate to unacceptable levels (α >0.1 and power <0.75). The type II error problem can be rectified by increasing the sample size in the severe hazard decrease situation (k = 0.5). However, this strategy has little effect on reducing type I error. When increased hazard is observed over time, the decision rules lead to fewer type I errors and power. In summary, only a decreased hazard situation will worsen the performance of the proposed design based on the exponential distribution assumption for TTE outcome.

Table 3 Sensitivity analysis with Weibull distribution assumed for TTE

We also used the simulation study to compare the proposed design with the Zee et al. multinomial design, based on two binary endpoints, as well as with Simon’s optimal design, based on a single-tumor-response endpoint, with approximately the same sample size and hazard ratio of the event of interest, such as progression. For example, the scenario with an early progression rate of 0.6 in H0 and of 0.4 in H1 was considered to have a hazard ratio of 1.5 in the progression, corresponding to the situation of T0 = 3 and T1 = 4.5 in Table 2. The results in Table 4 indicate that the probability of stopping the study early is similar to the multinomial design and Simon’s optimal design when tumor RRs for null and alternative hypotheses are P0 = 0.1 versus P1 = 0.3. With the lower (P0 = 0.05 and P1 = 0.2) or higher tumor RR (P0 = 0.3 and P1 = 0.5), the proposed design showed higher early stopping probability compared with both the multinomial design and Simon’s optimal design.

Table 4 Comparison with Simon’s optimal design and Zee’s multinomial design

We also compared the proposed design with those based on a single TTE endpoint [11, 12]. For example, we compared a null hypothesis with an RR of 0.1 and a median TTE of 3 versus an alternative hypothesis with an RR of 0.3 and a median TTE of 5, using 34 patients in the first stage and 68 patients in the final stage. The expected sample size of the approximate survival method [12] using the TTE alone was 51, with a type I error equal to 0.097 and a power of 0.850, whereas our proposed method with mixed RR and TTE endpoints yields a smaller expected sample size of 34.6, a type I error of 0.017, and a power of 0.881. For the same hypothesis using 52 patients in the first stage and 81 patients in the final stage, the expected sample size based on the TTE endpoint alone [11] was 63.5 with a type I error of 0.122 and a power of 0.935, and our proposed method based on mixed RR and TTE endpoints yields a smaller expected sample size of 52.6 in the presence of a type I error of 0.019 and a power of 0.979. The results showed that the mixed-endpoints design has a smaller expected sample size and higher early stopping probability according to H0, indicating that the proposed design is more likely to stop an inactive agent than those based on a single TTE endpoint.

Discussion of application

Numerous multistage designs have recently been developed for phase II clinical trials. However, these designs are based either on a tumor-response endpoint alone or a TTE endpoint alone (e.g., progression-free survival). This, however, may not be the optimal strategy for evaluating the efficacy of study treatments because a natural correlation could exist between tumor response and the TTE [20], and because abandoning either endpoint may cause severe information loss. Although Zee et al. [16] and Sun et al. [17] proposed multinomial designs to incorporate tumor response and EPD information, the dichotomous EPD endpoint does not permit fully extracting the information from a study, compared with the TTE endpoint. We generalized these designs to integrate the tumor response endpoint and the TTE endpoint, to fully use the information in the study and to generate efficient stopping rules. We assumed a Gaussian copula to describe the dependent structure between binary tumor response and continuous TTE, which is similar to the binomial-exponential setting used by de Leon and Wu [30]. Because no analytic solution exists, we employed a simulation-based method to generate the stopping rules for tumor response and median TTE under various fixed correlations between endpoints. The results of a two-stage design showed that the correlation has an effect on the stopping boundary at the final stage and on the decision criterion for early stopping in the first stage. As the correlation decreases, the boundary for early stopping also decreases, meaning that a shorter median TTE or smaller RR is required to stop the study at the interim analysis. When we apply the proposed design in practice, the correlation between RR and TTE can be estimated from the results of previous studies. When historical data are unavailable, particularly for a new drug, the preliminary study data can be used to estimate the correlation based on the copula likelihood function of the interim data. To make this method more applicable, we developed an R program for calculating the correlation.

The interim assessment can be planned at the time when half of the patients achieve progression, or the time corresponding to each response in the early stopping rules, whichever is shorter. For example, if H0: P ≤0.05 and TTP*med ≤3 versus H1: P >0.2 or TTP*med >4.5 and correlation ρ is set as 0.2 to approximate the correlation in the Zee multinomial design, the interim assessments can be conducted at the time when half of the patients develop progression or 4.5 months, whichever is shorter, when one response is observed. Compared with the Zee multinomial design, one advantage of the proposed design is that the interim assessment does not rely on the progression time and hence we would be able to decide earlier, which is particularly useful for trials with long time to progression (e.g., prostate cancer trials). Compared with those designs based on a single TTE, the proposed design has the ability to reduce the potential long waiting time for assessing the TTE outcome if response increases.

Similar to the two-stage design based on the TTE endpoint, the proposed design also has the limitation of requiring a longer wait period than tumor response to obtain the estimated median TTE [14]. However, the proposed design with the tumor-response endpoint could mitigate this limitation to a certain extent. For example, if a phase II trial is conducted using the sample size and hypotheses in the last block of Table S1a (correlation = 0), the clinician must wait 6.6 months or until the occurrence of eight events (median of 15 patients), whichever is shorter, when no response is observed at the first stage. If the eighth event occurs before 6.6 months without response, the trial can be stopped for futility; if less than eight events occur at 6.6 months, the study can be continued to stage II. In the event of one response the waiting period can be reduced to 4.9 months or until eight events have been observed. If the required period for the TTE endpoint evaluation is not excessively long in some advanced cancer studies, the proposed design has the potential to accelerate the inference, which may improve the efficiency of phase II clinical trials. Therefore, the introduction of tumor response in the proposed design could reduce the potential waiting time compared with designs based on a single TTE endpoint.

In our proposed design, we only allow early stopping for futility, because investigators commonly choose to continue the study in practice, even if early rejection criteria of the null hypothesis are fulfilled. Considering that the study would not be stopped early when either of the endpoints is extremely promising, we used the early rejection rules to adaptively bind the early acceptance rules to obtain reasonable stopping rules for futility. With 30 patients, the simulation results indicate that the proposed stopping rules can generally achieve the desired type I error of 5 % and power of 80 % when high RRs and a high hazard ratio between null and alternative hypotheses are expected. The type I error in the low RR and low hazard ratio design is higher than the desired 5 % level. If the sample size increases to 40, the desired type I and type II errors can be maintained in various situations. This implies that a sample size of 30 could be sufficient to achieve the desired type I and type II error levels for trials with a high expected RR and a high hazard ratio; if a low RR and low hazard ratio are expected, a slightly larger sample size of 40 may be adequate. If early stopping for activity is allowed, the boundary of early rejection may overlap with the boundary of early acceptance. This means that the conclusion of the effectiveness of the study treatment may be contradictory as being both efficacious and inactive in the first stage, because of the flexibility (or characteristics) of two diverse quadrants of parameter space in the response and the hazard of the TTE. In this case, the adaptive approach based on early rejection rules may be useful for eliminating the overlap.

The proposed design assumes that both the response endpoint and the TTE endpoint indicate study agent activity, which may be true in a targeted drug setting with unknown clinical activity. In the case of a cytostatic drug, possible tumor shrinkage or response may still indicate drug activity, although inhibited tumor growth is primarily targeted and the TTE endpoint is commonly adopted in the assessment. Thus, the null hypothesis can be rejected and the drug can be accepted if either of these endpoints exceeds the required level, despite another endpoint indicating futility. Tumor response is typically positively correlated with the TTE, and an extremely high RR (e.g., >60 %) coinciding with an extremely small median TTE (e.g., one month) is unlikely. In the event of this extreme situation occurring, further investigation of the study agent should be conducted to determine the mechanism behind the unexpected observation.

Although exponential distribution is commonly assumed for patient survival [31], the constant hazard implied by exponential distribution may be incorrect in practical trials. The sensitivity analysis indicated that the proposed design based on the exponential assumption for TTE is applicable in practical phase II cancer trials where the hazard increases with time. However, if the decreased hazard is observed in the study, the chance of incorrectly concluding that an inactive drug is active (type I error) or an active drug is inactive (type II error) may exceed the expected level. Furthermore, the assessment period, typically scheduled cyclically, may also affect the estimate of the median TTE [32]. The Panageas’s [32] simulation results showed that the commonly used upper-limit progression time (where the progression date is defined as the date at which progression is first detected during the assessment cycle) could overestimate the true median TTP or PFS, thus affecting the statistical inference only at the final stage rather than at the early stage. This is because only early acceptance of null hypotheses is allowed in the proposed design, and the true median TTE, which is shorter than the estimated TTE, still fulfills the early stopping criteria. Following the recommendation in the Panageas’s study [32], the upper limit and lower limit (where the progression date is defined as the date before one cycle at which progression is first observed) can be combined to draw the conclusion. The lower limit of the TTE can be employed to validate the inference based on the commonly used upper limit of the TTE, when observed results suggest rejecting the null hypothesis.

We also compared the early stopping probability and the expected sample size of our design with the Zee et al. multinomial design based on tumor response and the binary progression event, as well as with Simon’s optimal design based on single tumor response. The simulation results showed that the probability of stopping a study early is consistently higher, yielding a smaller expected sample size than Simon’s design, thereby indicating that integrating the tumor response endpoint and the TTE endpoint yields more efficient stopping rules than a design that has only a single tumor-response endpoint. When the expected RR is P0 = 0.1 versus P1 = 0.3, the performance of the proposed design is approximately equal to the Zee multinomial design, in early stopping probability and expected sample size. Unlike in other scenarios where P0 = 0.05, P1 = 0.2 and P0 = 0.3, P1 = 0.5, our design has a higher early stopping probability and a smaller expected sample size. Therefore, incorporating TTE into the design exhibits superior performance compared with the Zee design using binary endpoints in the expected sample size. Furthermore, compared with two-stage single arm designs based on a single TTE endpoint [11, 12], the simulation results also indicated the superiority of the proposed design in expected sample size and early stopping probability according to the null hypothesis.

Conclusions

The proposed single-arm phase II design extends the Zee multinomial design to fully use the information for various types of endpoint, where the TTE endpoint could be progression-free survival. The advantage of this design is its applicability either to cytotoxic or noncytotoxic treatment studies when the median TTE can be measured in the trials. Our proposed design requires a smaller expected sample size than other methods for maintaining the desired statistical properties. Therefore, when a single-arm design is adopted in a phase II trial setting, which may be due to limited patient availability or studies investigating a therapy with only a single agent [33], it would be a superior choice for drug screening in phase II clinical trials.