Surrogate threshold effect based on a meta-analysis for the predictive value of progression-free survival for overall survival in hormone receptor-positive, HER2-negative metastatic breast cancer

Purpose Clinical trials investigating therapies for metastatic breast cancer (mBC) generally use progression-free survival (PFS) as primary endpoint, which is not accepted as patient-relevant outcome within the German benefit assessment. Hence a validation of PFS as surrogate endpoint for overall survival (OS) is needed, e.g., in the indication of HR+, HER2-negative mBC. Methods A systematic search was conducted. RCT were included if at least one study arm investigated fulvestrant, letrozole, tamoxifen, exemestane, or anastrozole. Additionally, hazard ratios reported for OS/PFS including confidence interval or standard error were mandatory. Pearson correlation coefficient was calculated to estimate the relation of surrogate endpoint PFS and patient-relevant outcome OS as well as the surrogate threshold effect (STE) which is used to determine thresholds for the estimate of the surrogate endpoint. Results 16 studies with 5324 patients in total were included in the analyses. The correlation between hazard ratios of PFS and OS was statistically significant (r = 0.72, 95% CI 0.35–0.90) representing a positive linear relationship. STE analysis was applied. The meta-regression model revealed a STE for HRPFS of 0.60 and sensitivity analyses underlined robustness of the results. Conclusions Based on the derived STE, it is possible to draw conclusions on a significant effect in OS for a hypothetical trial demonstrating an upper confidence limit of HRPFS < 0.60 in PFS. However, only final OS results are able to confirm if a clinical relevant difference in survival time can be achieved. Electronic supplementary material The online version of this article (10.1007/s10549-019-05262-4) contains supplementary material, which is available to authorized users.


Introduction
Endocrine therapies are the mainstay of treatment in hormone receptor-positive (HR+), human epidermal growth factor receptor 2 (HER2)-negative metastatic breast cancer (mBC) except in life-threatening situations qualifying the patient to receive chemotherapy [1].
Clinical trials investigating therapies for mBC often use progression-free survival (PFS) as primary endpoint [2], since patients with mBC have a relatively long survival time of around 3 years in median. With the desire to rapidly translate promising new agents into clinical practice, there is the need for endpoints which can be measured in a timely manner. Therefore, it is currently discussed whether endpoints based on disease progression, including PFS, timeto-progression (TTP), or time-to-treatment failure (TTF), are appropriate to demonstrate clinical benefit. These endpoints ensure an early availability of study outcomes and can serve as sensitive parameters for the benefit of a study medication as they are not influenced by subsequent lines of therapy or cross-over [2,3]. Further advantages are the widespread use and comparability of PFS and TTP since they are most frequently used as primary endpoints in phase III trials and are worldwide accepted for the approval of new drugs [4][5][6].
However, the prolongation of overall survival (OS) is one of the most important therapeutic goals [7]. OS is regarded as unambiguous criterion, but there are certain disadvantages of OS as primary endpoint in the metastatic setting of breast cancer: the need for large numbers of patients, the long duration of follow-up phases until results become available, and the need for multiple subsequent therapies, which can confound OS. These limitations particularly cause difficulties in first-line studies [8,9].
Health technology assessment (HTA) agencies worldwide generally accept PFS as endpoint in clinical trials [10], whereas the German Institute for Quality and Efficiency in Health Care (IQWiG) and the Federal Joint Committee (Gemeinsamer Bundesausschuss, G-BA) do not accept endpoints based on disease progression as a patient-relevant outcome within the benefit assessment of pharmaceuticals because they are measured by imaging techniques. Patient relevance of such endpoints might be accepted when measured via symptoms experienced by the patient. This would, however, lead to an omission of the re-evaluation of metastases in the course of clinical trials, which is considered unethical by physicians and does not comply with guideline recommendations [11]. Possible solutions for these different requirements have to be developed.
IQWiG suggested methods for the validation of surrogate endpoints in HTA context [12]. The aim of this study was the application of these methods in the indication of HR+, HER2-negative mBC to validate PFS as surrogate endpoint for OS.

Literature search
A systematic search was conducted on the basis of the databases MEDLINE and EMBASE as well as in five EBM Reviews sources in September 2016 and was performed in accordance with PRISMA guidelines (Appendix A.1). The following keywords and associated subject headings were used: "breast cancer" and "metastatic" or "locally advanced" in combination with "fulvestrant" or "letrozole" or "tamoxifen" or "exemestane" or "anastrozole" (Online Appendices A.2-A.4). Inclusion criteria for trials are listed in Table 1. Randomized controlled trials (RCT) were included if at least 80% of the study population met the inclusion criteria. In case of missing information regarding HER2 status or HR status, the proportion of patients meeting the inclusion criteria was extrapolated based on epidemiological data. In case HER2 status was unknown, a proportion of 81.9% of HR+ patients was assumed to be HER2-negative [13]; for patients with both unknown HER2 status and hormone receptor status, a proportion of 64.5% was assumed to be HR+ and HER2-negative [13]. Trials with TTP or comparable endpoints were considered if the definition was identical to PFS (time from randomization to objective disease progression or death from any cause). Only studies reporting PFS according to Response Evaluation Criteria In Solid Tumors (RECIST) [14] were included to ensure standardized and comparable endpoint evaluation. Overall survival had to be reported in the studies and should be defined as the time from the date of randomization to the date of death from any cause.
Two reviewers independently assessed citations to determine relevance to the research question. Included studies were cross-checked for relevance by physicians. If several publications for one study were available, data from the latest publication or publications reporting final data cuts were used. Data from included studies were extracted by one reviewer; another reviewer checked for consistency against the original source. Risk of bias on study level was assessed and summarized for all included individual studies (Online Appendix A.5).

Statistical methods
As part of a rapid report, the German IQWiG presented methods for surrogate endpoints validation and recommendations for correlation-based procedures [12]. Health technology assessments are based on these methods in Germany. The methods include the evaluation of the certainty of conclusion of study results and the correlation between effect estimates of surrogate endpoint (e.g., PFS) and true outcome (e.g., OS) on trial level, whereas correlation is estimated by sample Pearson correlation coefficient r. Requirements for a successful surrogate validation are a high correlation (lower confidence limit (LCL) of r > 0.85) and a high certainty of conclusion of results of included studies. If the correlation is low (upper confidence limit < 0.7), no statement of surrogate validation is possible. In all other cases-where correlation is in the medium range and validity of surrogate endpoint is therefore unclear according to IQWiG methodology-they propose to apply the concept of STE [15], allowing conclusions on true endpoints by means of surrogate endpoints. STE is defined as minimal treatment effect on the surrogate endpoint explaining a non-zero (i.e., significant) effect on the true endpoint. In this context, STE represents the maximum value of the hazard ratio for PFS (HR PFS ) that needs to be observed in a trial to ensure the possibility to draw conclusion of a significant effect on OS.
First, we tested the correlation between both outcomes (H 0 : r = 0 vs. H 1 : r ≠ 0). Second, if correlation was medium, we fitted a random effects mixed-model to the data with moderator HR PFS and outcome variable hazard ratio of OS (HR OS ) weighted by standard error (SE) of OS using the restricted maximum likelihood (REML) estimator for the amount of heterogeneity. Since SE is usually not reported, we recalculated it by means of 95% confidence interval (CI) of hazard ratio with (log(HR) − log(HR LCL ))/z (0.975) , whereas z (0.975) is the 97.5 percentile of the standard normal distribution. Based on the regression fit, we calculated a prediction band to a significance level of α = 0.05 for HR OS . Meta-regression model and prediction values were implemented with R [16] using functions rma.uni and predict. rma from metafor package [17]. The STE resulted from the intersection of the upper prediction limit curve and the horizontal where HR OS = 1 (zero effect).
In sensitivity analyses, we investigated if factors HER2 status (reported vs. not reported), line of treatment (only first-line vs. others), and therapy option (studies comparing combination therapy with monotherapy vs. studies comparing two monotherapies) accounted for heterogeneity.

Systematic literature search
The search identified 9071 citations from MEDLINE ® , EMBASE, and EBM Review databases. We included 16 studies (26 full texts) for analysis ( Fig. 1).
Characteristics for included trials are summarized in Table 2. The 16 trials included 5324 patients in total. In ten trials, HER2 status was reported for the entire study population. Six trials were included in the analysis since 80% of the study population met the inclusion criteria due to calculations according to epidemiological data (see methods). Six trials (2875 patients) evaluated treatments exclusively in the first-line setting for locally advanced or metastatic disease, and ten trials (2449 patients) included pretreated patients or patients in various lines of treatment. Almost all trials included postmenopausal women except for two trials which included a small (2.9%) [18] or unknown [19] number of premenopausal women treated with GnRH agonists.
Twelve trials compared combination therapy with monotherapy, while four trials compared monotherapy versus monotherapy. Combination treatments were add-on to hormone therapy and comprised different compound classes in comparison to endocrine therapy.
Endpoints were reported for intention-to-treat population (seven trials), full analysis set (three trials), modified intention-to-treat (two trials), or for all randomized patients (three trials). For one trial, no information was given on the analysis population.

Statistical analysis
In the main analysis (pool of 16 identified trials), the correlation between hazard ratios of PFS and OS was statistically significant (r = 0.72, 95% CI 0.35-0.90, p = 0.0016) representing a positive linear relationship of surrogate endpoint and by this patient-relevant endpoint. According to the definition in IQWiG's rapid report, correlation was merely medium-sized and therefore the validity of the surrogate endpoint is unclear and a STE analysis is applied. The meta-regression showed low residual heterogeneity (τ 2 = 0.009, I 2 = 25%) and provided a significant result of the moderator coefficient β PFS (p = 0.0206). STE for HR PFS was 0.60 (Fig. 2), and thus for trials meeting the abovementioned inclusion criteria in this specific indication and upper confidence limit of HR PFS below STE, it is possible to draw the conclusion of a significant effect on OS by means of surrogate endpoint PFS.
Sensitivity analyses to check the robustness in the main analysis were performed to account for available information about HER2 status (sensitivity analysis 1), line of treatment (sensitivity analysis 2), or therapy option (sensitivity analysis 3) (Table 3). Due to the smaller sample sizes in the subpools, STE values deviate from the value in the main analysis, but correlation in all subpools is positive and at least of a medium magnitude, confirming the positive relationship between OS and PFS. In all subpools STE is below 1 except for sensitivity analysis 2b (Table 3). In this case, STE cannot be calculated (upper confidence limit of HR OS > 1 for any value of HR PFS ). Hence, meta-regression analyses in all specified subpools did not show heterogeneity regarding the observed factors and confirm the results of the main analysis.   According to registry data, a proportion of 81.9% of the hormone receptor-positive patients was assumed to be HER2-negative in case HER2 status was unknown [13]. For patients with both unknown HER2 status and hormone receptor status, a proportion of 64.5% was assumed to be hormone receptor-positive and HER2-negative [

Discussion
PFS is an accepted endpoint with a definition based on standardized criteria according to RECIST [14]. The outcome of PFS is not influenced by subsequent therapies, and results are timely available and a lower number of patients are needed than for OS. In addition, results are widely accepted for the approval [4,5] as well as the HTA evaluation of new drugs [10] except from German HTA bodies due to an assumption of missing proof of patient relevance due to evaluation of PFS by imaging and not by symptoms. From a physician's point of view, PFS has a high relevance for patients. In case of a progression, the patient's therapy needs to be changed, which entails different adverse effects and requires new procedures and adjustments of schedules. A proven progression also has a significant impact on the psychological well-being and quality of life [35].
Additionally, a prolongation of OS and maintaining quality of life continues to be the focus of treatment in the metastatic situation of breast cancer [7]. To quickly transfer results on PFS from trials with innovative therapies to clinical practice, it would be advantageous if a validation of progression-based endpoints as surrogate endpoint for OS is available, which was the aim of this study.
Methods used in this work have some limitations. It is possible that the pool of included studies does not include all publicly available data because the search was limited to three literature databases and included no further sources. In addition, several aspects often lead to exclusion of studies. One reason was poor reporting, for example if data for only one of the required endpoint were published. Lack of information regarding HER2 status leading to non-conformity with the defined patient population and no PFS/TTP assessment according to RECIST criteria were other reasons. Especially older studies were often not in accordance with the inclusion criteria.
The sensitivity analyses show that the STE values vary strongly when only very small study subsets are considered. Nevertheless, the values are not so far apart that they would point completely in the other direction, i.e., STE > 1. Furthermore, the STE is sensitive to outlier observations when number of studies in the model is low. The generation of randomization and whether allocation concealment was adequately carried out was rarely reported in the individual studies. To what extent this has an impact on the endpoints OS and PFS and finally on the STE remains unclear.
According to IQWiG's method description, the entire 95% CI of PFS has to be below the STE in order to take into account the uncertainty with which both estimators are affected. Gillhaus et al. [36] described that this approach reduces the α error, but also considerably reduces the power of the STE concept. Statistical power could be increased using a lower α significance level (e.g., 0.1 or 0.2) for the prediction band of HR OS in the meta-regression model. However, this assumption can only be made if the hypothetical trial is conducted in patients with HR+, HER2-negative mBC treated with endocrine therapies alone or in combination with other targeted treatments. The model does not intend to predict the outcome of OS concerning HR or differences in median of OS from the model.
In general, OS results always need a critical appraisal. Especially in mBC, an improvement of OS for a new therapy option is difficult to measure. Factors like the heterogeneity of the disease, therapy complexity with integration of local therapies (surgery, radiotherapy), and a wide range of systemic therapies as well as a long survival in the metastatic situation with numerous different sequential courses of therapy may have an impact on the results of OS. A model calculation has shown that the probability of demonstrating a significant OS benefit decreases to less than 30% for a post-progression survival (PPS) of more than 12 months [37]. However, survival of several years has been reached especially in mBC. In addition, depending on the required   statistical power, thousands of patients need to be recruited to identify a survival benefit. In the age of individualized therapy with numerous specific subgroups, these studies are hardly feasible. The authors also conclude that the interpretation of OS is only useful, if the PPS is really short [37]. Additional points to take into account are the clinical relevance of OS results. The STE calculated in this publication only allows to draw conclusions on OS in the abovementioned settings and about the statistical significance of OS. However, it is not possible to predict the differences in median survival times and its clinical relevance. Therefore, it is possible that the final result for OS is statistically significant in a trial but might not be considered clinically relevant. For example, a difference of 3 months in median OS is clinically relevant in an indication with very short survival times like metastatic pancreatic carcinoma [38]. MBC has comparably long survival times of 2-3 years [39] and a difference of 3 months in median OS would normally not be considered clinically relevant. Even if a meaningfully relevant difference in median OS was achieved, a proven prolongation of life with a simultaneous significant deterioration in the quality of life is not always a desirable therapeutic goal [40].
In conclusion, we were able to calculate the STE (0.60) allowing to draw conclusions on OS through the surrogate endpoint PFS besides minor methodological limitations in trials with HR+, HER2-negative mBC treated with endocrine therapies alone or on combination.
This means that for a hypothetical or future trial demonstrating upper confidence limit of HR PFS < 0.60 in PFS it is possible to conclude on a significant effect in OS. However, only final OS results can confirm if a clinical relevant difference in survival time is reached. For future prospects, reflecting the current results in regard to ongoing clinical studies examining the addition of CDK 4/6 inhibitors to endocrine therapy will be desirable since they mostly lack of statistical significant, mature OS data for the time being. As long as OS results are not available, conclusions using STE may be drawn from PFS. To gain quick results on a new drug, PFS remains a relevant endpoint with high clinical relevance.