Background

Histological grading is a well-established prognostic factor for breast cancer and is associated with the aggressiveness of the tumour [1]. An assessment of three morphological features determines the histological grade of breast tumours. These features include tubular formation (glandular differentiation), nuclear pleomorphism, and mitotic counts, and each component is given a score from I to III. The sum of the sub-component scores enables the assignment of tumours into three grades (Grade 1–3), referred to as Nottingham Histological Grade (NHG), where Grade 1 is associated with a good prognosis and Grade 3 is associated with a poor prognosis [2]. It provides prognostic information for clinically relevant subgroups (like estrogen receptor (ER)-positive and human epidermal growth factor receptor 2 (HER2)-negative patients) to determine the plan for adjuvant chemotherapy [3].

However, the assessment of histological grading has a high inter-observer variability including the assessments of individual subcomponents of histological grading [4,5,6]. A recent nationwide study in Sweden reported significant inter-laboratory variabilities for histological grading across different pathology laboratories [7]. Such variabilities indicate an intrinsic uncertainty in routine NHG assessment and potential errors, which can cause both under and over-treatment of breast cancer.

Recent advances in high-resolution digital whole-slide images (WSIs) have greatly enhanced the computer-based pathology workflow, paving the way to novel digital decision support solutions. Recently, deep learning-based analyses on WSIs have shown promising results in a multitude of tasks, including cancer classification, grading, and predictions of genetic mutations in prostate and lung cancers [8,9,10].

Deep learning, especially deep convolutional neural networks (CNNs), has been proven to be effective for modelling of WSI data, including in the application of breast cancer histological grading. Previously models for the classification of grades 1 and 2 (together) versus grade 3 have been implemented for breast cancer [11, 12]. Jaroensri et al. implemented a model that classified the sub-components, and the sub-component score, for breast cancer histological grading and the prognostic performance, was compared against routine classification [13]. Wang et al. developed a model based on histological grade morphology in breast cancer that was applied to improve risk stratification of intermediate-risk patients (histological grade 2) [14].

To our knowledge, this is the first study focussing on the development of a deep-learning-based breast cancer histological grade classification with a three-level grading system resembling the routine NHG in breast cancer with prognostic evaluation. We evaluate the proposed model (predGrade) from the perspective of prognostic performance (time-to-event) in both internal test data and a fully independent external test cohort and compare it with the routine clinical grade assignment.

Methods

Study materials

The patients in this study were from two Swedish cohorts, SöS BC-4 (n = 2421), and the SCAN-B cohort (n = 1262). SöS BC-4 is a retrospective observational study that included patients diagnosed at Södersjukhuset (South General Hospital) in Stockholm between 2012 and 2018 that had archived histological slides available and also available histological grade information. Patients that had received neoadjuvant therapy were excluded (n = 56). The SCAN-B cohort, which we used as an independent external test set, includes a subset of patients (n = 1262) enrolled in the prospective SCAN-B study [15], diagnosed between 2010 and 2019 in Lund, Sweden. Both cohorts consist of patients diagnosed with invasive breast cancer. Patients’ clinical information (i.e. clinical NHG, ER status, epidermal growth factor receptor 2 (HER2) status, tumour size, and lymph node status) was retrieved from the Swedish National Registry for Breast Cancer (NKBC) (Additional file 1: Table S1). For CONSORT diagram, see Fig. 1.

Fig. 1
figure 1

SöS-BC-4 and SCANB cohort descriptions and splitting criteria. a. The SöS cohort was first split into the training, internal test set 1, and internal test set 2 on the patient level. The split was stratified by clinical histological grading (NHG), ER status, HER2 status, and Ki-67 status. b. A five-fold cross validation (CV) split was further generated on the patient level within the training set (n = 1695 WSIs). Each CV fold consisted of a CV training set (80%) and a CV test set (20%) balanced on clinical NHG. The CV training set is further sub-split into the feature extractor training set (40%), the attention module (32%), and the tuning set (8%). c. SCANB cohort was used as the independent external test set

WSIs were generated (40X magnification) using Hamamatsu NanoZoomer histopathology slide scanners (S360 or XR) from clinical routine Haematoxylin & Eosin (H&E)-stained, formalin-fixed paraffin-embedded (FFPE) resected tumour slides. We included one H&E WSI per patient, which was either the established primary diagnostic fraction or otherwise the H&E WSI with the largest predicted tumour area.

Image pre-processing and deep learning modelling methods

WSIs were pre-processed and quality controlled in a standardised processing pipeline, followed by model optimisation, and performance validation of the system (Fig. 2).

Fig. 2
figure 2

Overview of the image pre-processing, model optimisation and performance evaluation. a. Standardised WSI preprocessing pipeline from retrieval of WSI at 40 × magnification to the cancer-detected tumour tiles from the WSI. b. Schematic overview of the image modelling strategy, including the deep CNN feature extractor and attention module. Model optimisation, hyperparameter tuning, and model selection were performed by cross validation (CV). In each CV training round, the feature extractor and attention module were trained from cancer tiles in the CV training set. In each CV validation round, the features extractor and attention model were re-optimised and subsequently, the CV validation set was evaluated. c. Two cut-offs were further derived from the slide-level prediction scores, which categorised the prediction scores into three-level predicted grades. The cut-offs were optimised by maximising the agreement between the predicted grade and clinical NHG. We further evaluated the prognostic performance of the predicted grade on recurrence-free survival

WSI preprocessing

The WSI pre-processing pipeline has been previously described in detail in [14]. A brief overview of the preprocessing steps is shown in Fig. 2a. First, we generated tissue masks excluding most of the backgrounds from the WSIs. We added a maximum value of 25 on the Otsu threshold in order to reduce the removal of the tissue regions in some cases due to the high threshold value on the transformed saturation channel. The tissue regions were divided into image patches (i.e. tiles) of size 1196 × 1196 pixels. The image tiles were downsampled by a factor of two from the original scanning resolution (40X) to 20X resolution (598 × 598 pixels; 271 × 271 μm). Next, we applied the Laplacian filter (OpenCV package version 3.4.2) on all the image tiles and computed the variance of the filtered tiles. Tiles with a variance lower than 500 units were considered blurry and excluded from further analyses [16]. To mitigate stain colour variability, the colour normalization method described by Macenko et al. [17] was applied, with a modification to enable WSI-level colour correction, as previously described in [14]. Lastly, we applied a pre-trained CNN model developed in [14] to detect invasive cancer in our current study population. Only tiles predicted as invasive cancer from the pre-trained model were considered as regions of interest and thereafter included in further analyses. The median number of invasive cancer tiles for the SöS-BC4 training set and SCANB cohort was 911 and 3069 per WSI (Additional file 1: Fig. S1).

Image analysis using deep learning

The SöS cohort was used for model development and internal validation. The cohort was split into the training set (n = 1695), internal test set 1 (n = 245 WSIs), and internal test set 2 (n = 481 WSIs) as shown in Fig. 1a). The training and internal test sets were split on the patient level and stratified by histological grading (NHG), estrogen receptor (ER) status, epidermal growth factor receptor 2 (HER2), and Ki-67 status.

The training and optimisation of the feature extractor and attention module were performed on the training set (n = 1695) using five-fold cross validation (CV). For each CV fold, the training set was split into a CV training set (80%) and a CV test set (20%) stratified by histological grading (NHG) as shown in Fig. 1b. The CV training set was further sub-split into the feature extractor training set (50%), the attention module training set (40%), and the tuning set (10%). Both the feature extractor (Resnet-18 CNN) model and the attention module were optimised against binary class labels (NHG 1 and 3) to ensure that the model learns high- and low-grade patterns despite substantial label noise (reflected by high inter-assessor variability in NHG grade label assignments). The proposed approach implicitly assumes that the NHG grading follows a continuum of morphological changes (1–3), and that NHG2 is the intermediate group with the highest assessment uncertainty and inter-rater variabilities. We, therefore, excluded NHG 2 from the model optimisation. These three sub-splits were stratified on the clinical NHG (Fig. 1b).

The attention-based Multiple Instance Learning (MIL) model was considered as the CNN modelling architecture inspired by Lu et al. [18]. It consisted of two separate trainable modules: the feature extractor and the attention module. The feature extractor was trained to learn breast cancer domain-specific tile-level representations, and the attention module was trained to aggregate these tile-level representations to whole-slide-level prediction scores. Importantly, we specifically used different sub-splits of the training set for feature extractor optimisation and attention module optimisation, respectively,

Feature extractor module

The feature extractor was a binary weakly supervised learning model [9, 14]. We applied the Resnet-18 [19] CNN architecture initialised with weights pre-trained from Imagenet [20]. In order to reduce overfitting, we included a dropout layer with a probability of 0.2 after the global average pooling layer. Furthermore, a fully connected layer of 1024 hidden units followed by rectified linear unit (ReLU) activation was added before the final output layer to increase the depth of the architecture. This model was trained on binary labels of NHG 1 versus NHG 3 with cross-entropy loss. We used stochastic gradient descent (SGD) optimiser [21] with a learning rate of 1e-5 and a momentum value of 0.9. At each training partial epoch end, we used the tuning set to validate the training performance and save the best model according to the lowest validation loss from the tuning set. We applied an early stopping to terminate the training when the validation loss showed no improvement after 50 consecutive partial epochs.

Attention module

We used the feature extractor to extract a 512-dimensional feature vector from the average pooling layer for each image tile in the attention module training set and the tuning set (Fig. 2b). These learned features were further used to train the attention module. The attention module consisted of an attention backbone and a classification layer with two output neurons, one for each class [18]. The attention backbone assigns and optimises the weights for each tile-level feature vector from each WSI and these derived weights sum up to one in order to be invariant to the number of tiles in each slide. The tile-to-slide feature aggregation was facilitated by the weighted average feature vectors from all image tiles in each slide [22]. The attention module was trained as a binary classification task to predict NHG 1 versus NHG 3 tumours using the cross-entropy loss. We used SGD optimiser with a learning rate 1e-5 and a momentum of 0.9. At each training epoch, we used the batch size of one single slide including all image tiles in it, based on our previous work [23].

Assignment of predicted histological grade (i.e. predGrade)

We obtained the slide-level predicted scores (i.e. P[class = NHG3|WSIi]) for the entire training set (N = 1695 WSIs) from the five-fold CV. We further optimised the two thresholds \({\theta }_{1},{\theta }_{2}\) on P(class = NHG3|WSIi) to generate a three-level predicted grade (i.e. predGrade 1, 2, and 3). The thresholds \({\theta }_{1},{\theta }_{2}\) were established through an exhaustive search by maximising the agreement between the clinical NHG and the predicted grades (i.e. predGrade 1, 2, and 3) using Cohen’s Kappa score (κ). The thresholds were optimised on the training set using five-fold CV.

Assessment of model performance

Performance of predGrade was evaluated in both five-fold CV and in the independent external test set (SCAN-B cohort, n = 1262). In performance evaluation in the SCAN-B cohort, the five CV models were treated as base models in an ensemble model, where the five predicted scores of P(class = NHG3|WSIi) were aggregated using the median across all base-model predictions. Next, we applied the thresholds \({\theta }_{1},{\theta }_{2}\)(see above and Fig. 2c) to map predictions to predGrade 1, 2, and 3.

First, we assessed the agreement between the predGrade and the clinical NHG in the independent external test using confusion matrices. Since the clinical NHG has high inter-rater variability, we utilise patient outcome (recurrence-free survival (RFS)), as our primary evaluation metric. We compared the prognostic performance of predGrade and the clinical NHG grade. The RFS defined recurrence (i.e. local or distant metastasis, detection of contralateral tumours) or death as the event outcome. Patients were followed from the initial diagnosis to the date of death/recurrence, emigration, or the last registration date, whichever occurred first. Kaplan–Meier (KM) curves for the predGrade and the clinical NHG on RFS using time since the initial diagnosis as the underlying timescale were used for visualisation purposes. Differences in survival probability among clinical NHG and predGrade subgroups were tested using the log-rank test. We assessed the associations between predGrade and RFS as well as clinical NHG and RFS separately by estimating hazard ratios (HRs) with 95% confidence intervals (CIs) using the Cox proportional hazard (PH) models. We used the time since the initial diagnosis as the underlying timescale. First, we fitted univariate Cox models for the predGrade and clinical NHG, respectively. Next, we fitted multivariable Cox models additionally adjusting for the well-established clinicopathological factors including tumour size, ER status, HER2 status, lymph node status, and age at the diagnosis. Tumour size was dichotomized as ≥ 20 mm or < 20 mm. ER status was positive if the immunohistochemical (IHC) staining indicated the presence of more than 10% ER positively stained cells. HER2 status was determined using IHC staining and FISH or SISH assay. Lymph node status denoted the presence of lymph node metastasis. Cases with missingness in the outcome or in any covariate were excluded from analyses. Two-sided alpha of 0.05 was used for all the statistical tests. Further, the c-index was computed for categorical predGrade, clinical NHG, and continuous slide-level predGrade score. Bootstrap resampling (n = 1000) was used to calculate the confidence intervals (CI).

Software packages used for computer vision and statistical analyses

WSIs were read using the package openslide (v.3.4.1) [24]. Tissue masking, tiling, and colour normalisation steps were implemented using the packages scikit-image (v.0.16.2) [25], OpenCV (v.3.4.2) [26], SciPy (v.1.5.0) [27], pillow (v.7.2.0) [28], pandas (v.1.0.5) [29], and NumPy (v.1.18.5) [30]. Further, TensorFlow (v.1.12.0) [31] was used to implement the pretrained invasive cancer detection model. The training, optimisation, and validation of the feature extractor and attention module were performed using PyTorch (v.1.7.1) [32]. The hyperparameter optimisation of the learning rate was performed using the tune library [33] from the package ray (v.0.8.6) [34]. All the WSI preprocessing and deep learning analyses were performed in Python v.3.6.10. Statistical testing for differences in hazard ratio estimates between the grade models was performed using the hr.comp2 function in the survcomp package (v.1.48.0) in R [35]. C-index was computed using the concordance function in the survival package (v.3.5.0) in R [36], and bootstrapping estimates of CI were calculated using the bootstrap function in the sjstats package (v.0.18.2) in R [37].

Results

Model performance in the five-fold CV of SöS-BC-4 training set

Classification performance

Classification performance (Fig. 3) of predGrade in comparison with clinical NHG was first assessed using CV (see Methods section), indicating fair agreement between predGrade and clinical NHG (Cohen’s κ = 0.33) [38]. 4.8% of clinical NHG 1 was classified as predGrade 3, and 5.3% of clinical NHG 3 was classified as predGrade 1. We have added an example of tiles from the correctly classified predGrade 1, 2, and 3 in (Additional file 1: Table S2).

Fig. 3
figure 3

Confusion matrix shows the agreement between the predGrade and clinicalNHG in the five-fold CV

Prognostic performance

Subsequently we evaluated the prognostic performance through time-to-event analysis. Figure 4 shows the KM curves comparing the risk stratification on RFS by clinical and predGrade. We observed similar stratification effects by the predGrade and clinicalNHG. Patients with clinicalNHG 1 or predGrade 1 showed the best survival, while patients assigned the clinical NHG 3 or predGrade 3 showed the worst survival (Fig. 4).

Fig. 4
figure 4

Kaplan–Meier (KM) curves on recurrence-free survival from five-fold CV in the SöS-BC-4 training set. a. KM curve stratified by clinical NHG and b. KM curve stratified by predGrade

In the univariate Cox models, we observed similar effect sizes between the predGrade or clinical NHG and RFS (Fig. 5a and b). The predGrade 3 (HR = 3.12, 95%CI = 1.69–5.76, p-value < 0.001) and clinicalNHG 3 (HR = 3.19, 95% CI = 1.74–5.85, p-value < 0.001) showed approximately two times higher risk of an event as compared to predGrade 1 and clinicalNHG 1, respectively. Neither HR estimates for predGrade 2 nor clinicalNHG 2 were found to be significant (p-value < 0.05) (Fig. 5a and b).

Fig. 5
figure 5

Evaluation of the prognostic performance on recurrence-free survival (RFS) in five-fold CV in the SöS-BC-4 training set. a. Univariate Cox PH regression analysis between the clinical NHG and RFS; b. univariate Cox PH model between the predGrade and RFS; c. multivariable Cox PH model between the clinical NHG and RFS adjusting for age, tumour size, lymph node, ER, and HER2 status; d. multivariable Cox PH model between the predGrade and RFS adjusting for age, tumour size, lymph node, ER, and HER2 status

In the multivariable Cox PH models, adjusting for tumour size, lymph node, ER, and HER2 status, the predGrade3 remained associated with a higher risk of death/recurrence (HR = 2.27, 95%CI = 1.07–4.82, p-value = 0.033) (Fig. 5d). A similar association was also noted for the clinicalNHG 3 (HR = 2.60, 95% CI = 1.18–5.70, p-value = 0.017) (Fig. 5c). Neither predGrade2 nor clinicalNHG2 was found to be significantly (p-value < 0.05) associated with RFS (Figs. 5c and d). In addition, we observed a higher risk of death/recurrence linked to older age and tumour size equal to or larger than 20 mm, while the number of lymph nodes, ER status, and HER2 status was not related to RFS (Fig. 5c and d).

Model performance in the independent external test set (SCANB cohort)

Classification performance

The classification performance, as assessed by the confusion matrix (Fig. 6) and estimation of Cohen’s κ = 0.33, was found to be consistent with CV results.

Fig. 6
figure 6

Confusion matrix shows the agreement between the predGrade and clinical NHG in the independent external test set

Prognostic performance

In the independent external test set, KM curves showed similar risk stratification on RFS by the predGrade (log-rank p-value = 0.00049) as compared to the clinical NHG (log-rank p-value = 0.0002) (Fig. 7).

Fig. 7
figure 7

Kaplan–Meier (KM) curves on recurrence-free survival stratified by clinical NHG and predGrade in the independent external test set. a. KM curve stratified by the clinical NHG and b. KM curve stratified by the predGrade

In the univariate Cox PH model, we observed similar effect sizes in the associations between clinical NHG, predGrade, and RFS (Fig. 8a and b). Patients with clinical NHG 3 (HR = 3.58, 95% CI 1.88–6.81, p-value < 0.001) or predGrade3 (HR = 4.07, 95% CI 1.75–9.47, p-value < 0.001) had three-to-four fold increased risks of death/recurrence (Fig. 8a and b) as compared to those with clinical NHG 1 or predGrade 1. On the other hand, the clinical NHG 2 (HR = 2.59, 95% CI 1.36–4.92, p-value = 0.004) and predGrade 2 (HR = 2.52, 95% CI 1.10–5.81, p-value = 0.030) were linked to around 2.5-fold increased risk of death/recurrence (Fig. 8a and b).

Fig. 8
figure 8

Evaluation of the prognostic performance on recurrence-free survival (RFS) in the independent external test set. a. univariate Cox model between the clinical NHG and RFS; b. univariate Cox model between the predGrade and RFS; c. multivariable Cox model between the clinical NHG and RFS adjusting for age, tumour size, lymph node, ER, and HER2 status; and d. multivariable Cox model between the predGrade and RFS adjusting for age, tumour size, lymph node, ER, and HER2 status

In multivariable Cox PH analysis, adjusting for tumour size, lymph node, ER, and HER2 status, the associations between the predGrade and the clinical NHG, with RFS were no longer found to be statistically significant (Fig. 8c and d), while the effect size estimate was in the same direction as for the univariate analysis. In addition, we noted that older age at diagnosis and larger tumour size was linked to a higher risk of death/recurrence, while ER positive was related to a lower risk of death/recurrence (Fig. 8c and d). The number of lymph nodes and HER2 status was not related to RFS (Fig. 8c and d).

Next, we tested for the difference in hazard ratio estimates for clinical NHG 2 versus 1 and predGrade 2 versus 1, indicating no statistically significant difference (p-value > 0.05). We also tested for the difference in hazard ratio estimates for clinical NHG 3 versus 1 and predGrade 3 versus 1, revealing no significant difference (p-value > 0.05). The hazard ratios in this analysis were calculated from the multivariate Cox PH model after adjusting for the established covariates.

Lastly, we evaluated the c-index of the categorical predGrade and clinical NHG in the univariate Cox PH model. We observed the c-index of 0.62 (95% CI 0.57–0.67) and 0.64 (95% CI 0.59 – 0.69) for predGrade and clinical NHG, respectively. Further, we observed the c-index 0.62 (95% CI 0.56 – 0.68) for the continuous predGrade predicted slide score.

Subgroup analysis restricting to ER (+ ve) or ER(+ ve)/HER2(− ve) groups in the independent test set

We plotted KM curves on RFS stratified by clinical NHG and predGrade among ER (+ ve) patients (Fig. 9a and b) and ER (+ ve)/HER2(-ve) patients (Fig. 9c and d).

Fig. 9
figure 9

Subgroup analysis: Kaplan–Meier (KM) curves on recurrence-free survival (RFS) in the independent external test set within ER(+ ve) or ER(+ ve)/HER2( − ve) groups. a. KM stratified by clinical NHG in ER(+ ve) patients; b. KM stratified by predGrade in ER(+ ve) patients; c. KM stratified by clinical NHG in ER(+ ve)/HER2( − ve) patients. d. KM stratified by predGrade in ER(+ ve)/HER2(-ve) patients

In the univariate Cox model restricted to ER (+ ve) patients, we observed increased risks of death/recurrence associated with clinical NHG 3 (HR = 2.63, 95% CI 1.32–5.24, p-value = 0.006) or predGrade 3 (HR = 3.27, (95% CI1.37–7.82, p-value = 0.008) (Fig. 10a and b). The clinicalNHG 2 had a 2.42-fold increased risk of death/recurrence, while the predGrade2 was not related to RFS albeit a similar point estimate HR of 2.07 (95% CI 0.86–4.64, p-value = 0.091) (Fig. 10a and b). In the multivariable analysis, the association between clinical NHG and RFS as well as the association between predGrade and RFS diminished and was no longer statistically significant (Fig. 10c and d).

Fig. 10
figure 10

Evaluation of the prognostic performance (RFS) of predGrade on in the independent external test cohort within ER(+ ve) or ER(+ ve)/HER2( − ve) groups. Univariate Cox PH model for a. clinical NHG and b. predGrade on RFS among ER(+ ve) patients, c. multivariable Cox PH models between c) clinical NHG, d. predGrade and FRS among ER(+ ve) patients adjusting for age, tumour size, and lymph node. Univariate Cox PH model between e. clinical NHG, f. predGrade and RFS among ER(+ ve)/HER2( − ve) patients. Multivariable Cox model between g. clinical NHG, h. predGrade and RFS among ER(+ ve)/HER2( − ve) patients adjusting for age, tumour size, and lymph node

Among the ER(+ ve)/HER2( − ve) patients, we observed HR = 2.43 (95% CI 1.19–4.99, p-value = 0.015) for clinical NHG 3 and HR = 3.15 (95% CI 1.29–7.65, p-value = 0.011) for predGrade 3 (Fig. 10e and f). The clinical NHG 2 was associated with the RFS (HR = 2.41, 95%: CI 1.26–4.62, p-value = 0.008), while the predGrade2 was not related to the RFS (Figs. 10e and f). In the multivariable Cox PH models among ER(+ ve)/HER2( − ve) patients, neither clinical NHG nor predGrade was related to RFS (Fig. 10g and h), while older age was linked to poor RFS in the analysis for predGrade (Figure h).

Again, we tested for the difference in HR estimates between NHG and predGrade in the subgroup analyses, both for grade 1 versus 2 and grade 1 versus 3 and found that neither was significantly different (p-value > 0.05), indicating that the prognostic performance was similar between NHG and the predGrade model. Further, we evaluated the c-index of the categorical clinical NHG (0.59 (95% CI 0.55–0.64)) and predGrade (0.61 (95% CI 0.56–0.65)) in the univariate Cox PH analysis. Lastly, we observed the c-index of 0.62 (95% CI 0.56–0.67) for the continuous predGrade predicted slide score.

Discussion

In this study, we developed a deep learning model to reproduce clinical NHG breast cancer patients. The proposed model was first evaluated using CV, followed by validation in a fully independent external test set. Histological grading of breast tumours is routinely assessed in the clinical setting and remains an important prognostic factor contributing to clinical decision making, especially for ER (+ ve)/HER2( − ve) patients. However, it is well known that NHG suffers from substantial inter-assessor and inter-laboratory variability, which motivates the development of decision support solutions that can improve quality and consistency in the assessment.

Our proposed model, predGrade, exhibited a fair label agreement with the clinical NHG (κ = 0.33). This imperfect agreement is likely driven by the ground truth labelling noise, both during training and validation, mostly from the intermediate NHG 2, given the high inter-rater variability observed in NHG 2 [4]. However, interestingly we observed similar prognostic performances (RFS) for predGrade compared with the clinicalNHG. We also noted similar prognostic performance between predGrade and clinical NHG when restricted to the clinically relevant subgroups of ER(+ ve) or ER(+ ve) and HER2( − ve) patients. Our results suggest that the deep learning-based predGrade provides similar prognostic performance (RFS) of the clinical NHG, which is a key consideration since the conventional NHG grade is primarily used in clinical settings as a prognostic factor [39]. This indicates that deep learning-based solutions can provide decision support based on the same principles of histological grading while offering the benefits of being objective and consistent. The model has the potential to reduce inter-assessor variability between pathologists and systematic variability between pathology laboratories, which has recently been shown to result in [7] unequal diagnostic quality for patients.

Previous studies have focussed on classifying NHG1 and 2 (low-intermediate) combined versus NHG3 (high) [11, 12]. Wetstein et al. reported a 37% increased risk of recurrence associated with a high grade compared to a low–intermediate grade. Wang et al. on the other hand demonstrated that a deep learning model optimised to discriminate NHG 3 versus 1 can [14] further stratify the intermediate NHG 2 into NHG2-low and NHG2-high, enabling improved prognostic stratification of the NHG 2 group of patients.

Jaroensri et al. mainly focussed on the development of the composite NHG score by developing predictive models for the individual subcomponent scores and validated them against the panel of pathologists [13]. Despite a difference in survival endpoints, we observed similar c-index of 0.62 in comparison with 0.60 (95% CI 0.55–0.65) reported by Jaroensri et al. for the combined deep learning-based NHG grade. We believe that the NHG score, rather than the individual subcomponents, has a higher clinical relevance, at least with respect to prognostic applications.

In our modelling strategy, we make some key assumptions. Due to the inter-observer and inter-laboratory variability present in the clinical NHG with relatively higher variability in NHG 2, thus we decided to optimise our model for the classification of NHG 3 and 1 where there is less label noise. As stated earlier, this modelling strategy is based on the assumption that the tumour grading exists in a continuum instead of the discrete labels at the morphological level with a spectrum ranging from low to high grade, apart from assuming more reliable ground truth for low and high NHG. Such variabilities in the ground truth labels are one of the important challenges in developing deep learning-based clinical decision support tools, especially the development of weakly supervised learning-based models where the label is only available at the WSI level. An alternative approach to the modelling problem would be to attempt to reduce label noise, which could be achieved by e.g. utilising consensus labels assigned by a set of assessors as performed in [13]. However, such attempts remain challenging due to the number of resources required and the shortage of pathologists available in most parts of the world.

Conclusion

In this study we developed and validated a deep learning-based model for breast cancer histological grading, providing a similar three-group grade assignment as the well-established Nottingham Histological Grading system. We found that despite the relatively low concordance of grade labels with clinical NHG, the proposed model provides equivalent prognostic stratification of breast cancer patients. The proposed model has the potential to provide objective and consistent decision support for histological grading, reducing previously observed inter-assessor and systematic inter-laboratory variability in breast cancer histological grading, and with the benefit of increased equality for patients and reduced risk for over- and under-treatment.