Background

The goal of phase II clinical trials in oncology is to identify new drugs which are sufficiently promising in terms of efficacy to warrant further investigation [1]. By separating effective from ineffective treatments in phase II, appropriate phase III trials may be conducted. Efficient phase II trials designs are critical, as a large pool of drugs must be tested in a limited pool of patients at high cost [2, 3]. However, just as there is typically uncertainty over the mechanistic specificity of new agents [4], phase II evaluation is complicated by uncertainty over what clinical outcomes might be observed and indicative of treatment efficacy. This renders the choice of clinical trial endpoints challenging, as some agents may induce tumour shrinkage, some may prevent worsening of disease, and others may do both, with variation by disease [58]. In addition, the majority of agents investigated in clinical trials are ineffective, and the ability to stop phase II trials early is desirable in such cases [3].

The most frequently employed phase II oncology endpoint is the response rate (RR) [9], which is most often defined using the RECIST criteria [10]. According to the RECIST criteria, a tumour response occurs if there is a 30% decrease in the sum of the longest diameters of measured tumour nodules. Tumour response has as its opposite progressive disease (PD), which is defined as a 20% increase in the same sum of diameters. Cases in which progressive disease occurs at the time of the first tumour measurement after treatment initiation can be termed early progressive disease (EPD). Tumours not shrinking or growing enough to reach definitions of response or progression are termed as having stable disease (SD).

Higher response rates are associated with improvements in survival [1114] and are predictive of eventual regulatory approval [15], but this endpoint may not be appropriate for all drugs or diseases. Specifically, there are situations in which disease stabilization may occur but actual responses may be rare, such that using RR as the sole benchmark could lead to the dismissal of potentially useful drugs. For example, despite a response rate of 2%, sorefanib is now standard treatment for incurable hepatocellular carcinoma [8]. Phase III study only occurred because the failed phase II primary endpoint of response was ignored in favour of other signals of efficacy, including the duration of disease stabilization and survival [16].

Stable disease has also been associated with survival improvements [17], but is typically not used alone, rather frequently being combined with RR in an endpoint termed clinical benefit or disease control rate [18, 19]. Alternatively, because a prolonged stable disease period would appear to offer patient benefit, other endpoints are used such as time to progression (TTP) [20], defined as the time interval until a cancer meets the definition of progressive disease. Progression-free survival (PFS) expands TTP such that the endpoint is marked at the time of either tumour progression or patient death.

Due to the large numbers of ineffective, and frequently toxic, agents studied in phase II, ethics dictate that many phase II clinical trials employ a two-stage method, which may be designed with the goals of minimizing trial size when agents are truly ineffective [21, 22]. Using TTP or PFS rather than RR in a two-stage design generally requires a longer time to assess outcomes, potentially requiring additional patients to receive an ineffective treatment [2325]. Furthermore, trials based on TTP or PFS alone may conclude an agent is inactive after stage I, even if it induces increased responses. In an untargeted population, it is possible that a subpopulation of patients of unknown size and an unknown molecular marker will have a tumour which is targeted by the treatment. Whether such tumours will shrink (i.e. respond) or simply stop growing (i.e. demonstrate an increased TTP) would be unknown. But there may be considerable interest in an agent which demonstrates either an increase in TTP, as occurred in the development of sorafenib, or RR, as was observed with crizotinib[26]. The ability to combine the RR and TTP endpoints would improve phase II trial sensitivity to drug activity when the nature of that activity is uncertain.

While much research has focused on RR and disease stabilization, the use of EPD has also been studied as part of a multinomial endpoint [27, 28]. It should be noted that EPD is directly related to TTP, being by definition the earliest measurable manifestation of progression. If one assumes a common distribution for TTP in a population, a sufficiently high rate of EPD will predict a shorter (and perhaps undesirable) TTP. Yet, while EPD may provide an early signal of drug inactivity, it is not intuitive to clinicians, who are more accustomed to considering TTP comparisons using median values.

The present work combines endpoints with the aim of improving phase II trial sensitivity and specificity while addressing the need for intuitive measures. Specifically, the investigator specifies desirable and undesirable values for RR and TTP only, so that both potential manifestations of drug activity can be observed as signals of activity. The model then generates stopping rules for a two-stage trial with RR and TTP. In addition, employing an exponential distribution for progression in order to relate specified TTP parameters to their corresponding EPD values, the model generates stage I stopping rules using easily calculable RR and EPD rates. Using EPD at stage I in lieu of TTP avoids the delay required to observe TTP for the entire stage I cohort and allows earlier stoppage of the trial should EPD be too high (and therefore the corresponding TTP too low). This paper summarizes an assessment of this model using different parameters of interest to outline the possibilities and limitations of such a combined endpoint, hereafter termed the Combination Stopping Rule (CSR).

Methods

Stopping rules for a single-arm, two-stage trial were constructed using simulations performed in TreeAge Pro Healthcare software (Version 1.0.2, 2009, Williamstown, Massachusetts) (program available on request). For this analysis, the desired statistical power and alpha error were restricted to ≥ 80% and ≤ 0.05 for the overall study throughout, however, other error limits could be used in the future as needed. For each simulation, the user specifies the RR of interest, RR of disinterest, median TTP of interest, median TTP of disinterest, and stage I and II sample size, (n1, n2). The user may also alter time of first tumour measurement and an absolute minimum median time for tumour progression allowable for a drug. Stopping rules are based on RR and median TTP at the second stage of accrual, but early stopping could occur at the end of the first stage of accrual when there are poor RR and EPD rates. Based on median TTP values of interest and disinterest, the model uses an exponential distribution to calculate EPD and assigns response as a dichotomous variable based on the specified probability.

The null hypothesis (H nul) specifies the response rate (r nul) and median TTP (ttp nul) that render a drug uninteresting for further development, such that: H nul: rr nul and ttpttp nul, where r is that actual response rate and ttp is the actual median TTP. Similarly, the alternate hypothesis (H alt) specifies the response rate (r alt) and median TTP (ttp alt ) that would render a drug interesting for further development, such that: H alt: rr alt or ttpttp alt. At stage I, interpolating on the progression curve and using the time of first measurement to determine the resulting null EPD rate (epd nul), the null hypothesis is expressed as H nul: rr nul and epdepd nul, where epd is the rate of early progression, while the alternate hypothesis is expressed as H alt: rr alt or epdepd alt. Note that Hnul, indicative of drug inactivity, is only accepted if both RR is low and median TTP is low (or at stage I, the surrogate of TTP, EPD, is high). At stage II, if either RR is high or median TTP is high, then Hnul is rejected in favor of Halt and the drug is considered active. Early stopping at stage I for rejection of H nul is not permitted.

Functionally, using the investigator inputs, the simulation first establishes the stage II stopping rules (RR, TTP) required to achieve the desired power. The null hypothesis is rejected if r 1 + r 2r 2a or ttpttp 2a, where r 1 + r 2 is the cumulative number of patients with responses at the end of stage II, ttp is the median TTP at the end stage II, and r 2a and ttp 2a are the response and median TTP thresholds determined by the software. The stopping rules do not consider any association between the TTP value and response for an individual in the trial. The software then establishes stopping rules at stage I incorporating RR and EPD which optimize power at the expense of increased alpha error where necessary. At the end of stage I, therefore, the null hypothesis is accepted if r 1r 1nul and epdepd 1nul, where r 1 and epd are the number of patients with response and EPD at the end stage I, and r 1nul and epd 1nul are the thresholds ascertained by the program.

Thresholds are identified by the program using 100,000 simulated trials. RR is evaluated using sequential increments of one patient, while for TTP increments are 0.25 months. For a threshold to be valid, it must satisfy the α error when RR = r nul and median TTP = ttp nul , and it must satisfy the β error when either RR = r alt or median TTP = ttp alt . For calculating the β error, half the simulated trials are performed with RR = r alt and median TTP randomly assigned to a value less than ttp alt , while the other half are performed with median TTP set to ttp alt and RR randomly assigned a value less than r alt . RR and EPD thresholds are then generated for the stage I test, while ensuring error rates are maintained for the entire study. Additionally, simulations are restricted such that RR + EPD ≤ 1 at stage I and by the imposed absolute minimum median time to progression.

The rate of patient censoring for median TTP estimation may also be altered by the user. For our modeling, it was assumed that patients who come off study due to toxicity or death (but not disease progression) prior to the time of first tumour measurement are replaced, although this may not be generalizable to all real-world phase II studies. Patients censored for TTP after the first tumour measurement were not replaced, and estimation of median TTP used the Kaplan-Meier method.

Results

Thresholds generated by the software using a fixed sample size (n1 = 15, n2 = 15) while varying H nul and H alt are shown in Table 1. Parameters for H nul and H alt were based on the response values used in prior work [22, 28] with the addition of plausible median TTP values. To interpret this table, the first row, where r nul = 0.05, r alt = 0.2, ttp nul = 3 and ttp alt = 6, would be read as follows: if there were zero responders and 5 or more patients with early progressive disease at the end of stage I, the study would be stopped and H nul accepted. Otherwise, the second stage sample would be recruited, after which H nul would be rejected if there were 5 or more responders or a median TTP of 5.25 months or higher. The resulting power would be 0.815 and the alpha error 0.035. For true uninteresting drugs, the probability of stopping the study (accepting H nul) at stage one would be 0.21, and the expected number of patients recruited would be 26.8.

Table 1 TTP and RR Thresholds Generated with fixed N while varying H nul and H alt (1-beta = 0.8, alpha = 0.05, censoring 0.05, n1 = 15, n2 = 15)

For small studies (n 1 = 15, n 2 = 15), differentiating two endpoints is difficult, resulting in low probability of early stopping after stage I in some circumstances. In the most extreme case evaluated, a design with r alt = 0.2, r nul = 0.05, ttp alt = 7, and ttp nul = 4 results in stage I rejection values of r 1 ≤ -1 and epd ≥ 16, indicating the study is unable to reject H nul at stage I and all trials will recruit 30 subjects. In other designs, the α error could not be maintained. Only trials with large differences between r alt and r nul as well as between ttp alt and ttp nul were able to satisfy both error estimates satisfactorily.

The effect of increasing the study size is seen in Table 2. Improvements in alpha error rates are observed and higher rates of early stopping are found. A minimally lower ttp 2a is also sometimes noted, a result of the interplay between the thresholds chosen for RR and TTP; in larger studies, the model is able to find a value for r 2a which gives a RR closer to r alt (i.e. higher), and the paired ttp 2a is thus slightly lower to maintain the specified power. For studies with the highest r alt/r nul and ttp alt/ttp nul values, studies need to be relatively large to achieve an error rate of 0.05. Higher error rates may be acceptable in some circumstances.

Table 2 TTP and RR Thresholds Generated with Larger N (1-beta = 0.8, alpha = 0.05, censoring 0.05)

If the censoring rate for TTP is increased to 0.1 from 0.05, the error rates and stage II thresholds are similar (Table 3). The stage I thresholds vary more in some cases.

Table 3 TTP and RR Thresholds Generated with Censoring set at 0.1

In contrast to the Simon optimal or Fleming designs [22, 29], the probability of early stopping (PES) of these designs appear to be reduced. For example, the Simon optimal design comparing rnul = 0.05 versus ralt = 0.20, with α ≤ 0.05 and β ≤ 0.20 and a total sample size of 29 patients, the PES after 10 patients is 0.599, while the Fleming design with 15 patients in each of 2 stages has a PES of 0.463, albeit with an α = 0.058. In contrast, the PES for the CSR is only 0.21, indicative of the increased difficulty of differentiating between two hypotheses.

Discussion

Uncertainty over drug effect and the concern over discarding drugs that maintain disease stabilization without inducing tumour shrinkage has led investigators to look for alternatives to response rates as the sole marker of drug activity [30]. Recognizing this, the Combination Stopping Rule (CSR), which uses both median TTP and RR, is derived. The CSR incorporates EPD, based on estimates of TTP, in the stage I decision-making process to provide an early signal of drug inactivity and allow for early termination of an inactive agent.

Accepting the investigator's inputs for desirable and undesirable RR and median TTP, the model can generate thresholds for patient RR and median TTP for the second stage and patient RR and EPD rates for the first stage that meet the desired error rates. Larger studies are necessary to maintain acceptable alpha error rates when evaluating higher median TTP and RR values of interest.

Stopping rules employing RR only are well established and optimal designs have been proposed in terms of minimizing the number of patients required for study [22]. In the present study, values for n 1 and n 2 are specified by the investigator, making direct comparisons difficult. However, as the design measures two endpoints concurrently, the CSR generally requires additional numbers of patients in both stages, and greater levels of activity to deem a treatment of interest for further study [22, 29]. The greater response requirement at stage II is a product of the CSR being designed to achieve the stated power when studying a population with an equal likelihood of having either 'good' response induction or 'good' time to progression.

In other work, EPD has been combined with RR [27, 28]. That combination may change the sensitivity of the phase II trial to drug activity, stopping early to accept H nul in some additional instances and finding drug activity in some instances where the sole measurement of RR would not [31].

EPD and TTP each offer specific advantages. Compared with EPD, TTP is more intuitively meaningful to investigators, and it is easier to specify TTP durations of interest and disinterest when setting trial parameters. In addition, TTP is likely a better reflection of overall patient benefit than EPD, as EPD assesses only very early progression. Although trial sizes may be larger in some instances for the CSR than for those trials employing only RR or RR and EPD, this characteristic is common to studies assessing time to progression or progression-free survival [9, 27, 32]. Conversely, a disadvantage of TTP as a solitary endpoint is the time required to observe disease progression in sufficient numbers of patients. This can be particularly problematic for multistage trials, where holding recruitment at the end of the first stage to await results can negatively impact on recruitment momentum and cost. The CSR addresses this issue by interpolating back from the specified median TTP to create a stage I set of rules employing EPD. As such, the delay to stop an ineffective treatment at stage I is minimized. The present model therefore combines the familiarity of RR and TTP with the early signals of EPD measurement.

Stopping rules combining RR with TTP may be useful in the setting of targeted drugs with unknown clinical activity or in drugs which are believed to be cytostatic [33]. There is evidence that investigators are reluctant to rely upon response alone to measure new drug activity. In several studies where observed response rates have not achieved the predetermined threshold for activity, investigators have noted signals of disease stability or survival and advocated further study [3437]. While imperfect, there is data to support a correlation between TTP and survival, and it may thus be a useful addition to RR alone [20, 38].

There are limitations to the present study. The study employs TTP rather than PFS, while the latter is generally favoured because it includes survival [39]. Although rules adding PFS could be devised, they would require assumptions of a survival hazard in addition to assumptions about tumour growth and response, adding complexity to the model and uncertainty to the results. Similarly, randomization of phase II trials is recommended by some authors [40]. However, given the number of agents under investigation and the greater sample sizes required for randomized studies, non-randomized studies still predominate [9]. Furthermore, studies involving limited patient populations--such as those requiring an infrequent biomarker or rare disease--may render a randomized study impractical. Optimal single-arm methods are therefore still required.

Also, although the alpha error increases with smaller difference between ttp alt and ttp nul, practically, differences between ttp alt and ttp nul smaller than 3 months are unlikely to be interesting. It is noted too that the present study reports on only selected values for n 1 and n 2, although other values are possible. Finally, the stopping rules were generated with the assumption that a new drug under study has equal chances of having a desirable RR or a desirable TTP, although this cannot be known. Other assumptions could be made if it was felt that a drug was more likely to induce regression or stabilization, and the program could be modified.

As a model, the CSR cannot mimic disease processes with complete accuracy. The model assumes that the population undergoes tumour progression in an exponential distribution. It is unlikely that any one formula will adequately cover all diseases, and other curves, such as that of Gompertz, could be considered. However, exponential growth is a generally accepted distribution [4143]. Testing the model with actual clinical trial data should provide insights into its behaviour. In addition, the model establishes actual tumour response independently from an individual subject's TTP within the study. This works for the model as responses are measured in aggregate, and responses could be assumed to be associated with the longer individual TTP's. This method was used for two reasons: first, it is unclear how a response should move a subject along the growth curve, and such a process would necessitate further assumptions. Second, the true median TTP of a simulated drug is established according to the investigator's input parameters and on whether true 'good' or true 'bad' drugs are being assessed. Allowing a response in an individual subject to influence that individual's growth curve (and thus TTP) requires that the TTP's of the remaining subjects be shifted in compensation, when such results should remain independent. Finally, the timing of tumour measurements during a trial will affect the trial's accuracy in detecting drug activity, a fact which needs to be carefully considered when using the CSR as well as other trial designs [44].

Conclusion

The CSR provides a new method of measuring drug activity in a two-stage, phase II oncology trial by combining two well understood measures, RR and TTP. By also determining thresholds for RR and EPD at the first stage of accrual to assess for early signals of drug inactivity, the method allows for earlier stage I stopping without the delay that would be required by awaiting the TTP of every patient. This method is well suited to drugs which may have uncertain or low rates of response but which may induce stabilization.