An essential step towards the implementation of prediction tools in daily clinical practice is the validation in the target population (Steyerberg 2009). The present study is the first one testing the Dutch INFLUENCE-nomogram with external data from another country. Although its predictions for the LRR-risk in the German cohort comprising 6520 breast cancer patients were less accurate than in the Dutch modeling-cohort, it did not perform worse in terms of discrimination-ability (C-statistic/AUC German validation-cohort: 0.73, CI 0.69–0.77 vs. C-statistic/AUC Dutch modeling-cohort: 0.71, CI 0.69–0.73).
Germany and the Netherlands are direct European neighbors which have many things in common. Both the Netherlands Cancer Registry and the German Tumor Center Regensburg (as part of the Population based Cancer Registry Bavaria) are member of the European Network of Cancer Registries (https://www.encr.eu/) and follow the mandatory data-collection standards and dataset requirements developed by this network. But the similarities between the two countries go beyond registration rules. Similarities are also reflected by highly similar patient and tumor characteristics. Moreover, the national breast cancer treatment guidelines of the Netherlands and Germany exhibit a large degree of congruency, since they rest on the same evidence base like in many countries (NABON 2012; Leitlinienprogramm Onkologie (Deutsche Krebsgesellschaft, Deutsche Krebshilfe, AWMF) 2019; Wolters et al. 2012). Nevertheless, some substantial differences concerning treatment modalities can be observed, pointing to different national preferences in breast cancer treatment. As a matter of fact, the breast-conserving surgery rate in the Dutch cohort is 21.5% lower than in the German cohort, which also might explain the less frequent use of adjuvant radiotherapies. There are several potential reasons for this difference. First of all, the Dutch cohort derives from the years 2003 to 2006, whereas half of the German patients were treated thereafter. Between 2000 and 2012, the rate of breast-conserving surgery in the Netherlands progressed from 54 to 72% (Maaren et al. 2018). Second, one has to bear in mind that the Dutch patients as a whole are compared to a single region in Germany. Even in a small country like the Netherlands, large interregional variation exists concerning the use of breast-conserving surgery. According to a recent publication of van Maaren et al. (Maaren et al. 2018), some Dutch regions featured breast-conserving surgery rates slightly below 80% already in the last decade, while others did not reach the 60% threshold as late as 2015. Variation only decreased slightly after adjusting for different case mixes. It is very likely that a similar variation can be observed in Germany. The hospitals in the southern German region that we used for validation very actively participate in scientific research, which explains their early and broad implementation of the breast-conserving approach. However, the different national preferences concerning the surgical approach should not have influenced the results of our study, since the LRR-rate is comparable between breast-conserving surgery and mastectomy (Yang et al. 2008). Moreover, type of surgery does not contribute directly to the predictions of the INFLUENCE-nomogram, since breast-conserving surgery was strongly related to radiation therapy and, therefore, only the latter variable was included in the model (Witteveen et al. 2015).
For the considerably lower rate of endocrine therapy in the Netherlands, there might be another explication. The hormone status was unknown for over 20% of the Dutch patients, presumably because no tests were performed. Consequently, these patients were not eligible for hormone therapy. However, still only two-thirds of the patients with known hormone status received hormone therapy compared to around 90% in the German cohort.
Regardless of such differences, the LRR-rate was comparable between both countries and it seems justified to use the German cohort for external validation. Even if therapy-allocation in both cohorts is different to a certain degree, the same surgical techniques, drugs for hormonal- and chemotherapy and radiation-schemes are used (NABON 2012; Leitlinienprogramm Onkologie (Deutsche Krebsgesellschaft, Deutsche Krebshilfe, AWMF) 2019). With a total of 184 recurrence events, it also meets an important formal requirement for an external validation, as according to Vergouwe et al. at least 100 events and 100 “nonevents” are necessary to determine whether a prediction tool performs well or not (Vergouwe et al. 2005). The rate of 2.8% LRR in the German validation-cohort is mildly, but not significantly (p = 0.205) above the level in the Dutch modeling-cohort (2.6%). Recently, van Maaren et al. published a paper reviewing long-term recurrence rates for breast cancer based on comprehensive NCR data from 2005 showing the hazard on LRR-events of Her2neu-positive and triple-negative patients peaks within the second post-surgical year and drops thereafter (Maaren et al. 2018). No clear trends were seen in Luminal A or B patients. The findings concerning the three latter groups could be confirmed within the German validation-cohort. No clear trend was to be seen with the Her2neu-positive patients. One reason for that might be the small number of patients within this group, which is also reflected by large confidence intervals—one recurrence event more or less can already change the situation considerably. Another possible reason for these differing observations are new developments in therapy. After the introduction of antibody therapy around 2005, Her2neu-positive patients were increasingly treated with Trastuzumab, which positively influences the outcome. Some of the patients in the German validation-cohort received this kind of therapy, while others did not. Obviously, no clear trends for this subgroup can be deducted from analyzing such a heterogeneous sub-population.
The INFLUENCE-tool’s accuracy in the validation-cohort was poor according to the Hosmer–Lemeshow test. A fact, which must not be overrated. Of course, the p value is considerably lower than 0.05, which is commonly regarded as a reasonable threshold between good and poor accuracy. The discrepancy between predicted and observed values partly may be attributed to the large confidence intervals caused by the relatively small number of events in the German validation-cohort. However, even if this aspect is taken into account, one can see that observed and predicted values do not differ by mere coincidence. Actually, the INFLUENCE-algorithm systematically underestimates the actual risk in each of the risk-stratified quintiles. A reason for that might be that the LRR-rate in the German cohort is slightly—but not significantly—higher than in the Dutch modeling-cohort, while generally more adjuvant radio-, chemo, and endocrine therapies (which the INFLUENCE-nomogram associates with a lower LRR-risk) are performed. This could possibly reflect moderate differences in therapy perception between the two populations, which could be an interesting topic for further investigation.
For clinical use, accuracy is less important than discriminative ability, anyway. Health professionals seek to know whether their patients require intensified follow-up, because early detection of recurrence events is associated with superior outcomes (Lu et al. 2009; Sangen et al. 2013; Schneble et al. 2014). On the other hand, it is desirable to spare low-risk patients the psychological and the health care system as such the financial burden of overly intensive follow-up schemes (Puglisi et al. 2014). To develop personalized follow-up pathways, physicians most probably will use the INFLUENCE-nomogram together with some kind of cutoff. The ROC curve depicts sensitivity and specificity for every possible threshold which can be used with the INFLUENCE tool. The C-statistic/AUC, therefore, represents the discriminative ability of the algorithm. For the 5-year overall LRR-risk algorithm, the C-statistic/AUC was 0.71 in the Dutch modeling-cohort; almost the same value was obtained by the first external validation with another Dutch cohort from 2007 and 2008. With the German patients analyzed within this study, the C-statistic/AUC was even slightly larger (0.73); this indicates good external validity. The number 0.73 means that if a—from the statistical point of view—ideal threshold of 1.6% was chosen, more than 70% of the high risk and more than 65% of the low-risk patients would be classified correctly, which, if implemented in daily clinical practice, would be an important step towards personalized medicine. The prediction tool also turned out to be robust against differences in population features, as no decline in model performance could be seen in any of the age-, type of surgery-, and intrinsic biological subtype-stratified subgroup analyses, except Her2neu-positive patients. While this also may be attributed to the issues with this special subgroup discussed earlier, the re-evaluation of Her2neu as an independent predictor in the INFLUENCE-model should be considered. According to Witteveen et al. the implementation of Her2neu did not improve the performance of the INFLUENCE-nomogram and consequently was omitted. However, the algorithm is based on patients from 2003 to 2006, which, as previously mentioned, was a period of change, as far as Her2neu is concerned and nowadays it is believed to have considerable influence on the outcome of interest (Gamucci et al. 2013; McGuire et al. 2017).
Focusing on the time-dependent models, discrimination-ability shows a negative gradient. The C-statistic/AUC moderately decreases mildly from year 1 to 4; in year 5 it suddenly drops to 0.50, indicating that there is no discriminative ability left. Notably, this is not a random phenomenon to be observed only in the validation-cohort. Internal validation based on the modeling-cohort returned a C-statistic/AUC of 0.84 for the first year and constantly declined until year five to a C-statistic/AUC of 0.62. While it is not surprising that the model performance is better in the modeling than in the validation-cohort, it must be stated though, that the INFLUENCE-nomogram obviously has some difficulties in predicting late recurrence events. Maybe this issue could be solved by updating the INFLUENCE-tool on a more recent modeling-cohort and re-evaluating the set of influence variables, like proposed above. It has to be acknowledged though that the occurrence of LRR could be influenced by unknown confounders, which might impede substantial improvement of model performance (Meads et al. 2012).