Clinical evaluation of deep learning-based risk profiling in breast cancer histopathology and comparison to an established multigene assay

Purpose To evaluate the Stratipath Breast tool for image-based risk profiling and compare it with an established prognostic multigene assay for risk profiling in a real-world case series of estrogen receptor (ER)-positive and human epidermal growth factor receptor 2 (HER2)-negative early breast cancer patients categorized as intermediate risk based on classic clinicopathological variables and eligible for chemotherapy. Methods In a case series comprising 234 invasive ER-positive/HER2-negative tumors, clinicopathological data including Prosigna results and corresponding HE-stained tissue slides were retrieved. The digitized HE slides were analysed by Stratipath Breast. Results Our findings showed that the Stratipath Breast analysis identified 49.6% of the clinically intermediate tumors as low risk and 50.4% as high risk. The Prosigna assay classified 32.5%, 47.0% and 20.5% tumors as low, intermediate and high risk, respectively. Among Prosigna intermediate-risk tumors, 47.3% were stratified as Stratipath low risk and 52.7% as high risk. In addition, 89.7% of Stratipath low-risk cases were classified as Prosigna low/intermediate risk. The overall agreement between the two tests for low-risk and high-risk groups (N = 124) was 71.0%, with a Cohen’s kappa of 0.42. For both risk profiling tests, grade and Ki67 differed significantly between risk groups. Conclusion The results from this clinical evaluation of image-based risk stratification shows a considerable agreement to an established gene expression assay in routine breast pathology. Supplementary Information The online version contains supplementary material available at 10.1007/s10549-024-07303-z.


Introduction
Based on classic clinicopathological variables, a significant proportion of estrogen receptor (ER)-positive and human epidermal growth factor receptor 2 (HER2)-negative early-stage breast cancer are categorized as clinically intermediate risk, thus providing limited information to guide adjuvant chemotherapy decisions.Prognostic risk profiling has become an integrated part of modern breast cancer diagnostics to provide additional risk information for this patient group for identifying patients where adjuvant chemotherapy can be omitted [1-3].
Among the established prognostic multiparameter diagnostic assays based on gene expression [4], the Prosigna assay (Prosigna Breast Cancer Prognostic Gene Signature Assay, Veracyte, South San Francisco, USA) is widely used and endorsed by national and international guidelines [5-8].The Prosigna assay identifies intrinsic molecular subtypes (i.e., luminal A, luminal B, HER2-enriched and basal-like) and provides an individual risk of recurrence (ROR) score between 0 and 100 along with a three-tier risk category (low, intermediate, high) based on ROR score and nodal status.The Prosigna assay contributes with prognostic information for patients with early ER+/HER2-breast cancer and its efficacy has been demonstrated in several study populations [1,[9][10][11][12][13][14].
The diagnostic foundation with pathological assessment of the well-established prognostic variables such as tumor size, stage and tumor grade, is still an essential part of clinical decision making [15].Among these, histologic grade is one of the most important prognostic factors for breast cancer [16,17].Approximately 50% of all [17][18][19][20], and around 60% of ER-positive/HER2-negative [1,21] breast cancers are classified as histologic grade 2, which is a heterogenous group of tumors with variations in terms of aggressiveness and prognosis [22,23], thus, associated with limited value to guide decisions on choice of therapy.A limitation in clinical decision-making is, despite the use of prognostic multigene assays, that tests such as Prosigna may classify up to 44% of histologic grade 2 tumors as intermediate risk [24], which does not add any clinically actionable information.In addition, the diagnostic multigene assays for breast cancer risk stratification show discordances in risk categorization between different tests [12,25].
Digital pathology workflows are becoming standard practice and enable application of advanced image analysis in the clinical setting [26].In addition, the recent evolution in deep learning, a field of artificial intelligence (AI), has further expanded the utility of machine learning techniques in computational pathology, making it possible to predict patient prognosis [27][28][29][30][31][32], response to neoadjuvant therapy [33], underlying molecular phenotypes [28,[34][35][36][37] or multigene assay results [38] in breast cancer using computerbased models to analyze and characterize histopathology whole slide images.Hence, computational pathology also plays a central role in precision medicine [26].By leveraging grade-associated morphological features from hematoxylin and eosin (HE)-stained histopathology slides, deep learning-based image analysis has been shown to enable stratification of grade 2 tumors into two risk groups associated with risk of recurrence [27].
The novel AI-based precision diagnostic solution, Stratipath Breast (Stratipath AB, Solna, Sweden), is a commercial CE-IVD marked deep learning-based image analysis tool that utilizes digitized histopathological whole slide images to stratify intermediate risk patients in terms of risk of recurrence [29].The test outputs a two-tier risk category.Compared with multigene assays, deep learning-based techniques have the strength of providing fast and cost-efficient solutions.
In this study, we provide the first clinical evaluation of the AI-based Stratipath Breast tool for image-based risk profiling where we compare it with an established multigene assay for risk stratification in a real-world breast cancer case series of clinically intermediate-risk ER+/HER2-tumors.

Patient inclusion and clinical data retrieval
This retrospective real-world case series consisted of 234 invasive breast tumors from patients with early ER-positive HER2-negative breast cancer, clinically assessed as intermediate-risk tumors and eligible for chemotherapy, diagnosed at Karolinska University Hospital and Södersjukhuset, Stockholm, Sweden.The case series represents consecutive tumors that had been analysed with the Prosigna assay in clinical routine at point of diagnosis between the years 2020 and 2022, to evaluate the patients' risk of recurrence according to the Swedish national guidelines [39] and regional treatment guidelines.The guidelines recommend that gene expression-based analysis should be considered for postmenopausal patients with lymph node negative or 1-3 positive nodes (N0 or N1) and ER+/HER2-breast cancer where there is uncertainty about the tumor's risk categorization prior to chemotherapy decisions.In addition, multigene assays were considered for luminal B-like tumors based on immunohistochemistry biomarker categorization to provide information for chemotherapy decision.The cohort has partly been expanded from Kjällquist et al. [24].The Prosigna assay had been performed at the Department of Clinical Pathology and Cancer Diagnostics, Karolinska University Hospital, on sections from formalin-fixed paraffin-embedded (FFPE) breast cancer tissue blocks, according to the manufacturer's instructions (Veracyte, South San Francisco, CA, USA) on the nCounter system (NanoString Technologies, Seattle, WA, USA) as part of clinical routine.Clinicopathological data including Prosigna results (intrinsic molecular subtype, ROR score (0-100) and risk group) were retrospectively retrieved from electronic records, along with the corresponding archived HE-stained, and parallel sectioned Ki67-stained, FFPE tissue slides.

Ki67 global scoring
Due to the change in Ki67 scoring recommendations in 2022, all tumors were re-scored for Ki67 by the global scoring method using the open-source image analysis software QuPath [40].All original Ki67 stained tissue slides were digitized in-house with a Hamamatsu NanoZoomer XR (Hamamatsu Photonics K.K., Shikuoka, Japan) at 40X magnification (0.226 μm/pixel).A protocol for digital Ki67 global scoring using QuPath was followed, described previously [41][42][43] and in accordance with recommendations from the International Ki67 in Breast Cancer Working Group (https://www.ki67inbreastcancerwg.org/;accessed on 18 July, 2023).The analysis was run on the entire invasive tumor area of the whole slide image (WSI) and output as a global Ki67 score (%).A few cases without digitized Ki67 slide available (N = 22) were manually evaluated by the global scoring method [44].The cut-offs applied in the national guidelines were used for three-tier groups: Ki67low (< 6%), Ki67-intermediate (6-29%), and Ki67-high (> 29%) [6].

Stratipath Breast analysis
Stratipath Breast (Stratipath AB, Solna, Sweden) is a commercial CE-IVD marked deep learning-based image analysis tool for risk stratification of breast cancer patients.HE-stained slides of FFPE tissue sections were retrieved and subsequently digitized in-house with a Hamamatsu NanoZoomer XR at 40X magnification (0.226 μm/pixel).Each HE-stained WSI was analysed by Stratipath Breast (version 1.1).The image analysis model encompasses consecutive steps including quality assessment, cancer detection and risk stratification.Twenty-three images did not meet the intrinsic quality control of the Stratipath Breast analysis and were excluded from subsequent analysis (Supplementary Fig. S1).In addition, for two cases the WSIs were not available for Stratipath Breast analysis.All Stratipath Breast analysis reports underwent pathologist review to verify that adequate tumor area was analysed as part of quality control, and cases that did not meet this requirement were excluded (N = 14; Supplementary Fig. S1) and is in line with the manufacturer's instructions for use (Stratipath AB).For each case Stratipath Breast provides a two-tier risk group; low risk or high risk, together with a continuous risk score (research use only).

Statistical analysis
Descriptions of agreements between two risk stratification approaches were reported by the actual number and percentage, and Cohen's kappa was used for two-group comparisons.The differences in distribution of patients belonging to each risk group, with respect to categorical clinical variables, were evaluated by the Fisher's Exact test when the minimum number of patients in a subgroup was less than 5, or by the chi-square test otherwise.For comparing differences in continuous variables that were not assumed to be normally distributed, the Mann-Whitney U test (comparison across two groups) and Kruskal-Wallis test (comparison across more than two groups) were used.The correlation between continuous scores were calculated with Spearman correlation.All statistical analyses were 2-sided, and a P value of less than 0.05 was regarded as significant.The above statistical analyses were performed in IBM SPSS Statistics (version 28.0; IBM, Armonk, New York, USA).Changes in classification between tests were visualized by Sankey diagrams in https://jsfiddle.net.

Patient characteristics
A total of 234 early-stage ER-positive/HER2-negative invasive breast tumors were included in the analyses of this study (Supplementary Fig. S1).The patients' clinical characteristics and associated Prosigna results are summarized in Table 1.Most of the tumors were invasive carcinoma of no special type (NST) or mixed NST (79.5%) and 17.9% were invasive lobular carcinomas (ILC).The median Ki67 In total 36 cases showed two-level discordant risk category (low and high) by the two tests (Supplementary Table S4).There were 24 Stratipath-high, Prosigna-low cases, which all were grade 2 or 3, node negative and luminal A. Among the 12 Stratipath-low, Prosigna-high cases there was a mix of all grades, all but one case was luminal B, and there was a higher proportion of invasive lobular carcinoma (ILC; 33.3%) and invasive mucinous carcinoma (IMC; 16.7%) than in the opposite two-level discordant group (16.7% ILC and 0% IMC).Upon review, two of the Stratipath-low, Prosigna-high cases had incorrectly reported tumor size, which may have altered the ROR score and Prosigna risk category if re-tested.

Clinicopathological characteristics across Stratipath risk groups
When comparing the distribution of clinical variables in each risk group, there was a significant difference in distribution of grade (p < 0.001), Ki67 status (p = 0.004), histologic subtype (p = 0.002) and intrinsic subtype (p < 0.001) across Stratipath risk groups, but no difference regarding PR status, lymph node status or tumor size (Table 4).The majority of grade 1 (10 of 12) and grade 3 (37 of 46) tumors were stratified as Stratipath low risk and high risk, respectively.There was a significant difference in the distribution of Ki67 score across Stratipath risk groups, with higher Ki67 scores in the high-risk than the low-risk group (Mann-Whitney U test p = 0.001; Fig. 2A).Among grade 2 cases, no difference in Ki67 score was observed between Stratipath risk groups (Mann-Whitney U test p = 0.058; Fig. 2B).For the group of grade 2 tumors, only histologic subtype (p = 0.010) and intrinsic subtype (p < 0.001) differed significantly between the Stratipath risk groups (Supplementary Table S5).ILC accounted for 17.9% of the tumors and 29 of the 42 (69.0%)ILC tumors were classified as low risk by Stratipath Breast, with even higher proportion among grade 2 cases (22.2% ILC and 74.4% of ILC as low risk).Among Prosigna intermediate-risk cases, a significant difference between Stratipath low-risk and high-risk groups was identified for score was 24.5% (range 3.85-75.2%)by digital global scoring method.Out of all included tumors, the Prosigna assay classified 76 (32.5%), 110 (47.0%) and 48 (20.5%) tumors as low, intermediate and high risk, respectively.The median ROR score was 47 with a range from 3 to 84.The Prosigna intrinsic molecular subtypes were distributed as follows: 127 (54.3%) luminal A, 107 (45.7%) luminal B, 0 (0%) HER2-enriched and 0 (0%) basal-like.Notably, one patient had more than three lymph node metastases.

Comparison between the tests for risk stratification
The Stratipath Breast analysis identified 116 (49.6%) tumors as low risk and 118 (50.4%) as high risk.Among Prosigna intermediate-risk tumors, 52 (47.3%) were stratified as low risk and 58 (52.7%) as high risk by Stratipath Breast (Fig. 1A; Table 2).In addition, 24 (31.6%) of the 76 Prosigna low-risk cases were upgraded as high-risk by Stratipath Breast, whereas 12 (25.0%) of the 48 Prosigna high-risk cases were downgraded by Stratipath Breast (Fig. 1B; Tables 2 and 3).The overall agreement between the two tests for low-risk and high-risk groups was 71.0%, with a Cohen's kappa of 0.42.Prosigna intermediate-risk results were not included in the overall agreement estimate as it is non-informative for treatment decision making.However, when grouping Prosigna low and intermediate risk together, out of the 116 Stratipath low-risk cases 104 (89.7%) were Prosigna low/intermediate risk and 12 (10.3%)were high risk (Fig. 1C, Supplementary Table S1).

ROR score and intrinsic subtype across Stratipath risk groups
ROR scores were higher in the Stratipath high-risk group compared to the low-risk group (p < 0.001), across all cases as well as in the Prosigna intermediate-risk group and among grade 2 cases (Fig. 3).Regarding the distribution of intrinsic subtypes, a total of 83 out of 127 (65.4%) luminal A cases were classified as Stratipath low risk and 74 of 107 (69.2%) luminal B cases as Stratipath high risk (Fig. 1D, Supplementary Table S9) and similar results were observed among grade 2 cases (Supplementary Table S10).A significant difference in distribution of Prosigna intrinsic subtypes across Stratipath risk groups and Prosigna risk groups was identified for all cases as well as for grade 2 cases (Chisquare test and Fisher exact test p < 0.001; Table 4, Supplementary Table S5).

Discussion
In this study we show the first clinical comparison between the AI-based tool Stratipath Breast and the well-established multigene assay Prosigna for risk profiling of clinically intermediate-risk breast cancer.The agreement between the two tests reached 71.0% (Cohen's kappa of 0.42) for classifying patients as low risk and high risk.This considerable overall agreement between Stratipath Breast and Prosigna, two methodologically different tests, is on a similar level to what has been observed previously with respect to agreement between different multigene assays [25].In a comparison of multigene tests in the OPTIMA Prelim trial, the overall agreement between Prosigna risk group and Oncotype DX recurrence score (Exact Sciences, Madison, USA) was 81.0% for low and high risk groups and 77.5% for low/ intermediate and high risk groups [25].A disagreement in this range is expected since these assays are based on different models, gene sets and clinical variables.Although demonstrating robust prognostic value on a population level, discrepancies on the individual patient level are evident when different multigene assays are performed on the same tumor sample [1,12].
Risk profiling of breast cancer, by e.g., the Prosigna assay, is currently an integrated part of clinical routine diagnostics for clinically intermediate-risk early breast cancer patients.This is crucial to avoid inadequate treatment, especially for ambiguous cases where traditional biomarkers are insufficient to predict if patients would benefit from e.g., additional adjuvant chemotherapy or could be spared grade (p = 0.002) and lymph node status (p = 0.013; Supplementary Table S6).

Clinicopathological characteristics across Prosigna risk groups
Across Prosigna risk groups, a significant difference in distribution of grade (p < 0.001), PR status (p = 0.011), Ki67 status (p < 0.001), tumor size (p = 0.001) and intrinsic subtype (p < 0.001) was observed (Table 4) and Ki67 status, tumor size, lymph node status and intrinsic subtype all remained significant among grade 2 cases (Supplementary Table S5).There was no difference in the distribution of histologic subtype between Prosigna risk groups.
Regarding Ki67 score, a significant difference in distribution of Ki67 score across all Prosigna risk groups was observed (Kruskal-Wallis test adjusted p < 0.001; Fig. 2C).Among grade 2 cases there was a difference in distribution of Ki67 score across the three Prosigna risk groups (Kruskal-Wallis test p < 0.001) with a significant difference between low vs. intermediate risk (adjusted p < 0.001) and low vs. high risk (adjusted p < 0.001), but not between intermediate vs. high risk (adjusted p = 0.126; Fig. 2D).In addition, Ki67 score showed a significant correlation with ROR score  adjuvant chemotherapy.Furthermore, when assessing Ki67 as a clinical factor together with Stratipath Breast, 95.5% of the Stratipath low-risk with Ki67-low/intermediate cases were also Prosigna low/intermediate risk.This supports the use of additional AI-based risk profiling to identify those patients that can be spared from chemotherapy, assuming that other risk factors are in alignment.Furthermore, differences in intrinsic subtype were observed not only in the Prosigna risk groups but also between Stratipath risk groups with higher proportion of luminal A tumors in the low-risk and luminal B tumors in the high-risk group.
Special resources at the individual pathology laboratory are required for running the Prosigna assay on the nCounter platform, including tissue preparation by macrodissection of invasive tumor region and sectioning prior to analyses.In chemotherapy.We observed a higher proportion of all cases classified as low risk by Stratipath Breast (49.6%) than by Prosigna (32.5%), which may impact treatment decision to spare patients of chemotherapy.However, as Prosigna also provides an intermediate-risk group, which in this study constituted of almost half of the cases (47%), the risk information remains inconclusive for these patients in guiding treatment decisions.Our findings also showed that Stratipath Breast classified a high proportion (47.3%) of the Prosigna intermediate-risk group as low risk.In addition, the majority (89.6%) of the Stratipath low-risk cases were found in the Prosigna low/intermediate-risk group, which is the patient group where adjuvant endocrine therapy alone is generally considered, depending on local routines.The Prosigna high-risk group is generally considered candidates for  [34,46], DNA mutations [47], intra-tumoral heterogeneity [31] or intrinsic subtypes [36] directly from HE-stained slides.Further, by leveraging grade-related feature extraction, deep learning has shown the ability to stratify grade 2 tumors into a low-and highrisk group [27].
The AI-based image-analysis tool used in this study extracts information based on the morphological appearance in the HE-stained tumor WSI to determine the patient's risk category.Histopathological variables including histologic subtype and tumor grade showed significant different distribution between low-risk and high-risk groups by Stratipath Breast.The deep learning model has capacity to capture a range of representations/features, that are grade-related, comparison, Stratipath Breast is a fully automated decision support tool which operates on digitized routine HE-stained slides, thus, limited additional workload apart for the digitization of routine slides is required, ensuring a considerably reduced turnaround time and cost.
The heterogeneous nature of breast cancer and especially inter-tumoral heterogeneity of histologic grade 2 tumors has been illustrated by gene expression analysis (DNA microarray), which shows that these tumors constitute of a mixture of gene expression patterns found in grade 1 and 3 tumors [22].The gene expression signature, Genomic Grade Index, was further developed and has shown prognostic potential to more accurately divide histologic grade 2 into a low-and high-risk group associated with risk of recurrence [22,23,45].Leveraging the capacity of automated feature extraction is in line with previous findings [24].A significant association between Prosigna risk categories and Ki67 status was observed in all patients, and in low-vs.intermediate-risk groups and in low-vs.high-risk groups in the grade 2 subgroup.For Stratipath Breast, Ki67 was significantly different between low-and high-risk groups when evaluating all patients but not in the subset of grade 2 tumors.This is not unexpected since the PAM50 gene assay incorporates eleven proliferation relative genes, and while Stratipath Breast does not explicitly include any information on proliferation, the AI-based approach has the capacity to capture proliferation associated morphological patterns in the WSIs.
Strengths of this study are that the CE-IVD marked commercial form of both tests was used and in a clinical case series from two sites in the intended patient population.However, the study has several limitations.One limitation to the study is the lack of follow-up information for prognostic comparisons but this was outside of the scope for this study and may instead be evaluated in future studies.Neither was treatment information available for evaluations of the effect on treatment decisions.Another limitation is that the Prosigna assay categorizes a relatively large proportion of the cases as intermediate risk, which is non-informative for decision making and was thus excluded in several comparisons focusing on the two-level agreement (low and high risk), and we note that this is an intrinsic limitation of the Prosigna assay.
To conclude, in this study of clinically assessed intermediate-risk ER-positive/HER2-negative breast cancers, we observed a considerable agreement between Prosigna and but other than the actual subcomponents routinely identified by the pathologist when determining Nottingham histologic grade (i.e., tubular formation, pleomorphism and mitotic count) [27].Here, we showed that histologic grade was associated with both Stratipath Breast risk groups and Prosigna risk groups.The association with tumor grade has been shown for several multigene assays [48][49][50][51] and tumor grade has also been incorporated in prognostic index (Nottingham Px) for prognostic stratification of the clinically intermediate-risk group of breast cancer (node negative ERpositive/HER2-negative) [52].
To establish the risk category, Stratipath Breast utilizes only the WSI as input, i.e., without incorporating other clinical variables.On the contrary, several of the clinical variables shown to be significantly different between Prosigna risk groups in this study, are included in the ROR score.The PAM50 gene expression of the tumor sample designates the intrinsic subtype and is combined with a proliferation score and tumor size to calculate the ROR score [53].Apart from the apparent methodological difference in the tests, it may be speculated that these differences in modalities could explain the discordances to some extent.
Differences in the prognostic performance between different assays can be explained by several factors, including different molecular markers included in the gene signature assays, where the Prosigna ROR score is largely determined by proliferative features whereas others are driven by ER-related features [51].We found a significant correlation between the proliferation marker Ki67 and ROR score (Spearman's rho 0.596) in this clinical case series, which  Stratipath Breast for low-risk and high-risk groups.This is the first study where a commercial multigene assay is compared to the image-based risk profiling tool Stratipath Breast.There was however a discrepancy of almost 30% between these two risk groups in the two risk profiling tests.
Further studies with outcome data and impact on treatment decision are of value for clinical comparisons.

Fig. 1
Fig. 1 Sankey diagram of the re-classification of risk group between methods.Prosigna risk group vs. Stratipath risk group (a).Prosigna risk group (low and high) vs. Stratipath risk group (b).Prosigna risk

Fig. 2
Fig. 2 Comparison of Ki67 score across risk groups and correlation to Prosigna risk of recurrence (ROR) score.Significant difference in distribution of Ki67 score across Stratipath Breast risk groups (p = 0.001, Mann-Whitney U test).Box plot illustrating median, interquartile range and range (a).No difference in Ki67 score between Stratipath Breast risk groups among grade 2 cases (p = 0.058, Mann-Whitney U test) (b).Significant difference in distribution of Ki67 score across Prosigna risk groups (adjusted p < 0.001, between all three groups,

Fig. 3
Fig. 3 Difference in risk of recurrence (ROR) score across Stratipath risk groups.Higher ROR score in Stratipath high risk group than low risk group (N = 234; Mann Whitney U test p < 0.001) (a).Higher ROR score in Stratipath high risk group than low risk group among ROR

Table 1
Patient characteristics for all included cases and grade 2 cases

Table 2
Comparison of agreement in risk stratification between Stratipath Breast risk group and Prosigna risk group

Table 3
Comparison of agreement in risk stratification between