Introduction

Breast cancer is the most common cancer among women and one of the leading causes of death [1]. To support the delivery of the highest quality of care provided by European hospitals to women with breast cancer, the European Society of Breast Cancer Specialists (EUSOMA) defined thirty-four quality indicators (QIs) covering several aspects of the cancer care process from diagnosis to surgical and oncological treatment and follow-up [2]. The QIs can act as tools for hospitals to standardize and optimize their quality of care and enable benchmarking between hospitals within and between countries by setting minimal standards and targets. Benchmarking between countries is advised to learn from each other and further improve the quality of care [3], but implementation is facing challenges.

First, calculating the QIs requires registration of all necessary data items in a structured database with clear definitions and coding rules. This often implies an increase of the registration burden. This challenge might be overcome using already available and structured data gathered by, for example, cancer registries. Moreover, standardized synoptic reporting of imaging, pathology, and treatment may improve data comparability and completeness.

Second, cross-country comparisons are challenging since differences in QI outcomes might be influenced by other underlying characteristics, like the composition of the population (i.e., a higher number of elderly) or screening protocols. Statistical methods, such as propensity score analytics, might limit possible confounding by indication [4].

Third, calculating a QI requires data on patient level per country. Sharing this privacy-sensitive patient data might intervene with compliance to the General Data Protection Regulation (GDPR), which introduced restrictions on data sharing to safeguard privacy [5]. To overcome this problem and make sharing of patient-level data redundant, the Netherlands Comprehensive Cancer Organization (IKNL) has developed an open-source federated learning infrastructure: Vantage6. Within this infrastructure statistical models and their parameters are shared, instead of privacy-sensitive patient-level data [6].

In this study, we performed a benchmark on the quality of breast cancer care between the Netherlands and Norway. For both countries, high quality in breast cancer care has already been demonstrated [7,8,9] but differences are described in, for example, population- and hospital density, travel distance to a hospital, and guideline recommendations. These factors might result in variation in QI outcomes for breast cancer. To address the above mentioned challenges we used data from the national cancer registries and a novel technology (‘Vantage6’), enabling federated propensity score analytics, preventing physical transfer of patient-level data.

Methods

Quality indicators

EUSOMA QIs were selected for assessment based on data availability, relevance, and clinical importance. This resulted in five selected indicators: two on MRI availability, two on appropriate surgical approach, and one on postoperative radiotherapy and local control (Table 1).

Table 1 The selected EUSOMA quality indicators (QI)

Patients

Data of all female invasive breast cancer patients diagnosed in 2017 and 2018, fitting the QI inclusion requirements (Table 1), were selected from the nationwide Netherlands Cancer Registry (NCR) [10] and the nationwide Cancer Registry of Norway (CRN) [11]. The NCR is hosted by IKNL, which has data managers in all hospitals collecting data directly from the patient files based on a notification by the Automated Pathology Archive (PALGA). CRN receives pathology data of all cancer cases in a copy of pathology reports sent to the clinicians. Norwegian clinical departments register in CRN’s electronic reporting service (KREMT). Reports are sent at different time points in the care pathway: at the time of diagnosis, each surgical event, primary adjuvant treatment, the start of hormone therapy, and the end of hormone therapy [8, 9]. A separate notification is submitted for every event during diagnosis, treatment, and follow-up.

Statistical analysis

To limit confounding by indication related to patient and tumor characteristics, Propensity Score Stratification (PSS) was used to balance characteristics between populations of the two countries. The propensity score is defined as the probability of being treated based on individual characteristics (covariate values) [4, 12, 13]. The patients were divided into strata that had similar propensity scores, with the objective to balance the observed covariate values between the two populations within each stratum [4, 14]. All QI outcomes per stratum were averaged to calculate the final percentage and 95% confidence intervals.

One of the challenges of a propensity score calculation between countries is that in the potential confounders (independent variables) there could be differences in ways of registration or in data definitions. In Table 2, the definitions of the variables that were used as independent variables in the calculation of the propensity score are described per cancer registry.

Table 2 Definitions of independent variables per cancer registry

The data balance was calculated before PSS and after PSS using a Standardized Mean Difference (SMD) on every independent variable for each QI. An SMD > 0.1 indicates an imbalance in the characteristics between the two countries for a QI. It is applicable to all variables due to the independency of unit of measurement [4].

A generalized linear model in the form of a logistic regression was used to calculate the propensity score. These propensity scores were divided in strata with the lowest SMD.

The QI outcomes were compared to the minimal standards and targets set by EUSOMA. Additionally, an odds ratio (OR) of the outcome of each individual QI was calculated to define the likelihood difference between the two countries.

Federated analytics (Vantage6)

The PSS was applied on each QI within a federated learning infrastructure (Vantage6; [15]). Using a federated implementation of the Generalized Linear Model (GLM, see Online Appendix A), Vantage6 enabled to compute the propensity scores, while patient-level data remained in their respective location. After acquiring the propensity score of each observation these propensity scores were sent to the investigator. These scores are completely void of identifiable information, as they represent a predicted outcome (i.e., a score between 0 and 1). Using this method of reducing confounding by indication, the PSS can also be applied if privacy-sensitive data may not leave the organization.

As a validation, the analysis was also performed in a non-federated manner (pooling the data) with R-package MatchIt.

Results

A total of 32,786 and 6377 patients were diagnosed in 2017 and 2018 in the Netherlands and in Norway, respectively (Table 3). Mean age was 62.4 years [standard deviation (SD) ± 13.8] in the Netherlands and 60.9 years (SD ± 12.9) in Norway.

Table 3 Descriptive analysis of Norwegian and the Netherlands invasive breast cancer patients diagnosed and treated between 2017 and 2018

The calculated results for each QI are presented in Fig. 1, both before and after PSS. The observed imbalance in several covariates decreased after PSS for most covariates. The calculated QIs were rather similar before and after PSS; therefore, only QIs after PSS were described. In addition, the minimum standard norm set by the EUSOMA is marked in Fig. 1.

Fig. 1
figure 1

Results EUSOMA QI before (crude QI, 1a) and after PSS (1b) presented as average mean with minimum standard and target marked

Pre-operative MRI

For the calculation of the QI ‘Pre-operative MRI,’ 21,664 patients from the Netherlands and 5262 from Norway were included (Online Appendix B). Patients with unknown tumor size (and therefore pTx) were omitted due to low occurrence and interference with PSS. Age, differentiation grade, pathological node stage (pN), and HER2 status had a higher SMD than the threshold of − 0.1/0.1 before PSS, which indicates a state of imbalance of the two countries. After applying a five strata PSS, the SMDs of these five imbalanced variables were significantly reduced and moved below the threshold. The proportion of patients preoperatively examined by MRI in the Netherlands was 37.0% (95% CI 34.1–40.0) and in Norway 17.5% (95% CI 15.3–19.7), both above the EUSOMA minimum standard of 10%. Patients in the Netherlands were more likely to be examined preoperatively by breast MRI [OR 2.8 (95% CI 2.7–2.9)] compared to patients in Norway.

Application of MRI

The analysis of this QI consisted of 7003 patients from the Netherlands and 752 from Norway (Online Appendix C). Age, histological tumor type, differentiation grade, estrogen receptor (ER) status, and progesterone receptor (PR) status had an SMD higher than the threshold. A five strata PSS resulted in a representable balance. With only “year at diagnosis” being over the threshold. However, the strata were not perfectly distributed for patients in Norway, with only 29 patients in stratum 5. Nonetheless, this did not affect the average results of the QI. The proportion of patients treated with primary systemic treatment (PST) undergoing breast MRI (before, during, and/or after) in the Netherlands was 83.3% (95% CI 79.1–87.5) and in Norway 70.8% (95% CI 66.4–75.2), both above the EUSOMA minimum standard of 60%. The EUSOMA target of 90% was not achieved by both countries. Patients in the Netherlands were significantly more likely to receive an MRI before, during, and/or after PST [OR 2.3 (95% CI 1.3–3.3)] compared to Norway.

Single breast operation

The first QI on surgical approach ‘Single breast operation’ included 28,806 patients from the Netherlands and 5029 patients from Norway (Online Appendix D). Differentiation grade, pT, and pN were imbalanced before the PSS. After applying a five strata PSS, only one pT was still imbalanced with an SMD of 0.101. Adjusting the number of strata did not further improve balance. The proportion of patients who received a single breast operation for the primary tumor in the Netherlands was 95.2% (95% CI 94.5–95.9) and in Norway 91.5% (95% CI 89.1–93.9), which was above the minimum standard (80%) and the target (90%) set by EUSOMA. Patients in the Netherlands were more likely to receive a single breast operation in the Netherlands [OR 1.8 (95% CI 1.4–2.2)] compared to Norway.

Immediate breast reconstruction (IBR)

For the QI ‘immediate breast reconstruction (IBR),’ 7116 patients from the Netherlands and 748 from Norway were included (Online Appendix E). Differentiation grade, pT, pN, and PR status were imbalanced with an SMD higher than the threshold. The five strata PSS did not improve the balance of the data. The proportion of patients receiving IBR in the Netherlands was 36.0% (95% CI 31.3–40.7) and in Norway 37.4% (95% CI 29.8–44.9). Both countries scored slightly below the EUSOMA minimum standard of 40% and the likelihood of receiving an IBR was similar for both countries [OR 1.2 (95% CI 0.7–1.7)].

Postoperative radiation therapy

For the analysis of the QI ‘Postoperative radiation therapy,’ 17,594 patients from the Netherlands and 3748 patients from Norway were included (Online Appendix F). Differentiation grade and pT were imbalanced before the PSS. This QI required a nine strata PSS to achieve a good balance, which resulted in none of the variables having an SMD higher than the threshold. The proportion of patients who received postoperative radiation therapy after surgical resection of the primary tumor and axillary staging/surgery in the framework of breast-conserving therapy in the Netherlands was 94.9% (95% CI 91.8–98) and in Norway 95.7% (95% CI 94.6–96.7). Both countries scored above the EUSOMA minimum standard of 90%. Norway reached the target of 95% and the Netherlands almost achieved this. Postoperative radiation therapy was applied to the same extent in both countries [OR: 1.1 (95% CI 0.8–1.5)].

The results of the sensitivity analysis in a non-federated manner (pooling the data) yielded comparable results (Table 4).

Table 4 Results of the sensitivity analysis for the PSS: a comparison of the non-federated and the federated learning infrastructure

Discussion

In this study, we compared QIs for patients diagnosed with breast cancer in the Netherlands and Norway. The challenges in benchmarking between countries were faced for five QIs using cancer registry data and propensity score analytics applied with a federated approach using Vantage6. This approach resulted in QI outcomes which were comparable to a traditional non-federated analysis, supporting other studies [16]. In addition, our study showed PSS could be executed in a federated manner.

Explanation of differences found

Some differences in the QI outcomes between both countries could be revealed. Despite the norm was reached in both the Netherlands and Norway, patients in the Netherlands were significantly more likely to be examined preoperatively by breast MRI (QI pre-operative MRI), to receive an MRI examination before, during, and/or after PST (QI application of MRI) and to receive a single breast operation compared to Norway. The target set by EUSOMA for application of MRI (90%) was not achieved by either of the countries. The likelihood of receiving postoperative radiation therapy or IBR was comparable for both countries, but for the latter QI, the minimum norm of 40% was not achieved in both countries.

Differences in the indicators between both countries could be influenced by many factors.

The first factor could be related to differences in and implementation of guidelines. In both the Norwegian and Dutch guidelines, the use of breast MRI is only recommended for selected patient groups [17, 18]. The significant differences in results could also be explained by differences in implementation of various recommendations in the breast cancer guidelines. In 2010, the Netherlands introduced new indications for pre-operative MRI in the breast cancer guideline [17]. It states that patients with lobular invasive breast cancer should be preoperatively staged using breast MRI, as this reduces the percentage of reoperation and mastectomy [17,18,19]. The same indication was introduced in the Norwegian guidelines in 2017 [20]. Moreover, Norway had the indication for pre-operative MRI in case of discrepancy between tumor size by ultrasound, mammography, and clinical examination where this has an impact on treatment, in cases where it is believed to be difficult to exclude multifocality, in T2 tumors that are planned for neoadjuvant chemotherapy and known hereditary risk of breast cancer with genetic defects.

Since the data included in the current study is from 2017 to 2018, it could be that the new guidelines were not (yet) fully implemented in daily practice. It is noteworthy that there was an increase in the proportion of patients receiving a pre-operative MRI in Norway from 16.7% in 2017 to 19.3% in 2018. The motivation for undergoing MRI with PST, as defined by EUSOMA, is to be able to evaluate the response to PST [2]. In the Netherlands, this was introduced in the breast cancer guidelines in 2012 [17]. Norway introduced this recommendation in 2007 [21]. The percentage of patients treated with PST undergoing breast MRI have steadily increased in recent years in both the Netherlands [7, 19] and Norway [22]. The recommendation to perform IBR, whenever feasible, has been part of the first breast cancer guideline of the Netherlands in 2002 [23], was added to the indicator set of the NBCA in 2012 and was part of the reconstruction guideline [24]. Consequently, increases in the use of IBR has since been observed in the Netherlands [25]. The breast cancer guideline of Norway introduced the possibility of IBR in 2007 as an alternative to simple mastectomy in 2007 [21]. In practice IBR was not performed routinely until after 2013 when it was stated that all patients undergoing mastectomy should be offered the possibility of IBR [25]. In 2016, the percentage of patients receiving IBR in Norway was 27% [26], which increased to 48.5% of patients < 70 years in 2020 [27]. Still, it remains debatable whether a norm on IBR could be set since patient preferences are important factors of influence. Postoperative radiation therapy has been recommended in both countries after breast-conserving surgery and in case of large primary tumors and locally advanced stage and after non-radical surgical intervention in case of positive lymph nodes [21, 23].

Second, the indicators in our study had different levels of evidence ranging from I to IV and were not all mandatory (Table 1). In case level of evidence is high and the indicator is mandatory, which is the case for the QI ‘Postoperative radiation therapy’ we revealed a high concordance between the countries and scores above the norm reaching the target. In case of the QI on immediate reconstruction the level of evidence is III and the indicator is recommended, which can explain not reaching the norm and differences between the countries.

Third, different definitions on inadequate tumor margins by clinicians might have influenced the QI outcomes on reoperation. In the Netherlands practice differs from other countries, whereby re-excisions are omitted in case of focally positive margins after breast-conserving surgery without impairing disease-free and overall survival [28].

Fourth, another important factor influencing the outcomes could be the data which were used for this study, which were obtained though national cancer registries. The way of gathering the data was different. The CRN is dependent on automated data transfers from hospitals and pathology laboratories [11]. The NCR gathers date directly from the patient files by trained data managers, but has limited access to image procedures for surveillance purposes, e.g., in case of high risks for breast cancer. For the QIs on MRI this might have influenced the results since the NCR might not have been able to include MRI examinations performed for surveillance purposes. Moreover, the reason to perform an MRI was not noted in both registries. This could be pre-operative staging (possibly applicable for most of the cases), but also inconclusive findings, lobular cancers, or performance of an MRI in high-risk screening setting could also have been part of the reasons. Moreover, after a patient receives PST in Norway, the pathology TNM classifications are not registered in the pathology report but as a separate new variable, which was not available for the analysis in this study. This caused problems with the analysis and therefore, the pathology TNM classifications were removed from the analysis. Due to this obstacle, the PSS was less comprehensive. The registration of a positive ER status is slightly different between the Netherlands and Norway. An ER level greater than or equal to 10% is defined as positive in the Netherlands, whereas in Norway an ER level greater than 1% is already defined as positive.

Use of PSS

Using PSS, it was possible to increase the balance in each subpopulation of the specific QIs. In every subpopulation, the differentiation grade and TNM classification variables were unbalanced, based on the SMDs. The PSS reduced the SMDs of most of the variables. However, following PSS the QI results changed only slightly except for the QI outcomes for IBR and MRI availability in Norway, which increased and decreased almost 4%, respectively. The differences in results after PSS in the Dutch subpopulations were small, with percentages of only 0.5%.

The results of the federated and non-federated analysis were almost identical. This is in concordance with the results of a study by Cellamare et al. in which they mathematically have proven that the federated GLM used to implement the federated logistic regression is mathematically identical to centralized variant [29]. It has to be noted that algorithms to compute regressions are never exact. They provide estimates and differences may occur based on the chosen hyperparameter settings. The GitHub repository [30] has been erected to provide source code for those interested to empirically compare results from centralized and federated linear models.

Future perspectives

In the era of having more emphasis on data protection and privacy, benchmarking between hospitals might be more difficult if data pooling is required for the calculation of QIs. The results of our study reveal that benchmarking using federated PSS was possible which might improve the possibilities to compare benchmarks between countries. For further studies, additional EUSOMA QIs and data of recent years should provide a more comprehensive view of the quality of breast cancer care. This could identify more areas for improvement and open discussions further ultimately improving the quality of care for breast cancer patients. Cancer registries should gather data on treatment and outcomes in a standardized manner following the FAIR principle (Findable, Accessible, Inter-operable, and Reusable), enabling federated learning easier. Automatic digital data extraction from for example cancer registries could decrease the registration burden.

Conclusion

In conclusion, propensity score stratification using federated analytics was successful in comparing QI between two countries, opening future possibilities on comparison of QIs without transfer of privacy-sensitive data and adhering to the highest standards of data governance and patient-privacy.