1 Introduction

In most solid malignancies, therapeutic decision making is primarily based on pathological staging of tumors. The traditional tumor, (lymph) node, metastasis (TNM) staging system [1] is routinely used to estimate patient prognosis and guide treatment worldwide. For certain tumor types, however, the TNM system lacks accuracy in assessing the metastatic potential of a tumor. For instance, TNM stage II colorectal cancer (CRC) comprises a heterogeneous group with a diverse outcome [2]. As a result, the TNM stage is not informative for therapy planning of these patients, leading to both under- and over-treatment. Reliable new biomarkers are needed to guide personalized adjuvant treatment for these groups of patients.

A widely studied prognostic factor is the tumor-stroma ratio (TSR), expressing the relative amounts of tumor and intratumoral stroma. TSR is a straightforward measure which can be assessed by microscopic inspection of hematoxylin and eosin (H&E) stained tissue sections. TSR has been shown to yield prognostic information in a range of solid malignancies, including breast cancer [3,4,5] and lung cancer [6, 7]. Generally, TSR is an independent prognostic factor, where a high content of intratumoral stroma is associated with a poor prognosis. A number of previous studies showed promising results on the prognostic relevance of TSR in CRC [8,9,10,11,12]. Despite this evidence, there is no implementation in routine pathology reporting. This may be attributed to the variety in methodology and the lack of a standardized procedure for TSR assessment. Published studies propose visual assessment (‘eyeballing’), systematic point counting, and the use of scanned (digitized) tissue sections (whole slide images; WSI). Although good inter-observer agreement was found in earlier studies [9, 11, 13], visual assessment of pathological quantitative features in general may suffer from reproducibility issues.

To facilitate an objective and standardized TSR assessment, image analysis and machine learning algorithms have been applied on H&E-stained sections of CRC before, however, these algorithms were applied to image regions extracted from WSI. Computer-aided tumor and stroma quantification has been proposed based on automated tissue segmentation in H&E-stained sections using a combination of hand-crafted features and machine learning [14]. Furthermore, TSR has been computed via automated point counting in H&E-stained images [15]. Similar image analysis techniques based on classical machine learning have been applied to tissue microarrays for epidermal growth factor receptor (EGFR) detection by immunohistochemistry [16, 17]. A new branch of machine learning algorithms, so-called deep learning algorithms, have recently entered the field of computational pathology and shown promise for automating certain tasks in histopathology. Detection of sentinel lymph node metastases [18] and of cancer in prostate biopsies [19] could successfully be performed using convolutional neural networks (CNN), a specific type of deep learning. We recently showed [20] that a deep learning-based algorithm can distinguish between 9 different types of tissue in CRC WSI with an overall accuracy of 93.8%.

The present study aims to leverage our previously developed CNN for automated TSR assessment in the CRC sub-class of rectal adenocarcinomas. Only a limited number of studies have been published on TSR for rectal cancers and in a sub-analysis (n = 43) by West et al. [12] its prognostic value could not be confirmed. Work by Scheer et al. [8] recently showed that TSR has potential as a prognostic factor for survival in surgically treated rectal cancer patients, however, TSR was only found to be an independent prognosticator in lymph node metastasis negative cases. The performance of the automated TSR system described here will be compared with data from human experts and its prognostic value will be evaluated in terms of disease-specific and disease-free survival times.

2 Materials and methods

2.1 Patients

An existing cohort of 154 patients [8] with rectal adenocarcinoma stages I-III was used. All patients received curative surgery in the period 1996–2006 at the Medisch Spectrum Twente hospital (The Netherlands). No patient was neoadjuvantly treated with radiotherapy and/or chemotherapy or died within 30 days after surgery. At the time of surgery, none of the patients had known distant metastases, inflammatory bowel disease, hereditary nonpolyposis colorectal cancer (HNPCC) or other/earlier cancers. Histopathological data were obtained from the Laboratory for Pathology Eastern Netherlands (LabPON). Clinical data were obtained from the Medisch Spectrum Twente hospital and the Netherlands Comprehensive Cancer Organization (IKNL). Collected clinicopathological data included tumor grade (differentiation), depth of invasion (pT) and lymph node involvement (pN) according to the Union Internationale Contre le Cancer/American Joint Cancer Committee (UICC/AJCC) TNM staging system [1]. Data regarding adjuvant therapy and local or distant recurrence were also available.

2.2 Tissue slide preparation and scanning

According to standard procedures at LabPON, formalin fixed and paraffin embedded tissue sections were cut at 2 μm and stained in an automatic stainer with hematoxylin and eosin (H&E) for routine diagnostic purposes. For the present study, a single slide per patient was selected which contained the most invasive part of the tumor and was used in diagnostics to assess the tumor pT-status. Slides were scanned at ×200 total magnification (tissue level pixel size ~0.455 μm/pixel) using a Hamamatsu NanoZoomer 2.0-HT (C9600–13) scanner (Herrsching, Germany).

2.3 Visual estimation of intratumoral stroma

Two observers (GvP, WM; both > 10 years of experience with TSR scoring) independently scored the slides using a conventional light microscope according to a previously published protocol for TSR assessment [9, 10]. Briefly, the procedure consisted of 1) coarse localization of the tissue area with the highest intratumoral stroma content at low microscope magnification, and 2) selection of one field of view at ×100 total magnification and visual estimation of the tumor-stroma ratio (TSR-visual) in the selected circular region. Ideally, the selected region should meet the following criteria: high intratumoral stroma content (predominantly found at the invasive margin of a tumor); presence of tumor cells at all borders of the field of view; no large quantities of muscle, mucus, necrosis or large vessels; and no tears or tissue retraction artefacts. As much as possible, the region with the highest stroma content (stroma hot-spot) was selected that met all the above requirements. TSR-visual was estimated by both observers independently, using 10% increments. As a result of the specific microscope and lenses used, the specimen-level diameter of the circular region was 1.8 mm at ×100 magnification. There is a lot of variation among published studies concerning used TSR procedures (e.g. major differences in the location and size of the assessed tissue regions as well as what was actually measured: relative tumor or stroma content). For clarity, in this study the tumor-stroma ratio was defined as TSR = 100% × [intratumoral stroma area] / [tumor area + intratumoral stroma area]. Lumen, tears and other tissue types in the selected circular region were excluded during visual estimation. Lastly, the tissue region considered most suitable for TSR assessment was identified during a consensus meeting between the two observers in which 1) a binary TSR consensus score was determined: ‘stroma-low’ or ‘stroma-high’, and 2), the center of the stroma hot-spot was marked on the glass slide.

2.4 Automated computation of intratumoral stroma

To study the value of applying a deep learning algorithm for automated TSR assessment (TSR-auto), a CNN was developed similar to a previously published algorithm [20]. The CNN performs tissue segmentation (i.e. subdivision of tissue areas) of H&E-stained rectal cancer WSI into nine different classes: tumor, intratumoral stroma, necrosis, muscle, healthy epithelium, fatty tissue, lymphocytes, mucus and erythrocytes. The CNN was trained using manually annotated regions in 74 WSI taken from the cohort used in this study. Regions to annotate were selected for covering tissue variety across WSI, rather than producing exhaustive annotations on a small number of WSI. Annotations were produced by a pathology researcher (OG) and a medical student, and were checked and corrected when deemed necessary by an experienced pathologist (AB). A digital staining normalization method [21] was applied to all WSI as a pre-processing step to accommodate for typical differences in tissue staining intensities, caused by variations in slide preparation. Unlike Ciompi et al. [20], here we used patches of 256 × 256 pixels for classification, which experimentally showed to improve performance and produce a smoother segmentation map (data not shown). Performance of the system was assessed by segmenting all WSI in the dataset in a five-fold cross validation fashion (at WSI level) and evaluating accuracy in all annotated regions.

To enable comparison, the CNN-based TSR-auto was computed in the same circular region (with 1.8 mm diameter) that was selected by the observers at the consensus meeting, where TSR-visual was assessed. The corresponding image data were extracted from each WSI as circles with a diameter of ~4000 pixels and processed further by the CNN described above (Fig. 1). Segmentation of a WSI into nine different tissue classes enabled in- and exclusion of specific tissue types comparable to the visual assessment procedure. The used definition of TSR-auto is similar to TSR-visual, expressing the area consisting of stroma as a percentage of the area occupied by both tumor and stroma.

Fig. 1
figure 1

Top row: Stroma hot-spot circles, 1.8 mm across, selected by the observers for the assessment of TSR-visuala and extracted with a diameter of ~4000 pixels for processing by the CNN. Bottom row: The same regions with tissues segmented by the CNN for the calculation of TSR-auto. aObserver consensus; bOther tissue includes classes: muscle, healthy epithelium, fatty tissue, lymphocytes, mucus and erythrocytes

2.5 Statistical analyses

In this study, TSR-visual and TSR-auto were compared as prognostic factors in rectal cancer. Statistical analyses were performed using IBM SPSS software v24.0 (Armonk, NY, USA). The intraclass correlation coefficient (ICC) was used to determine the correlation between TSR assessed by two observers and by the automated method. To investigate a possible relationship between clinicopathological variables and the numerical values of TSR-visual and TSR-auto, Mann–Whitney U and Kruskal–Wallis tests were performed for two- and multi-class variables, respectively. For further statistical analysis, TSR-visual and TSR-auto were dichotomized, subdividing patients into two groups: ‘stroma-low’ and ‘stroma-high’. Dichotomization of TSR-visual was performed based on a cut-off value previously established [10] on 63 colon cancer cases: stroma-high = TSR-visual > 50% and stroma-low = TSR-visual ≤ 50%. In this study, we analyzed results for two different cut-off values for TSR-auto since the optimal cut-off value for the automated approach is not yet established. One method of dichotomization used the ‘50% stroma cut-off’, similar to TSR-visual, referred to as TSR-auto(50%), and the other dichotomization method was based on the median value for all measured TSR-auto values, referred to as TSR-auto(median), yielding equal numbers of patients in stroma-low and stroma-high groups.

Inter-observer agreements were calculated using Cohen’s Kappa (κ) on the dichotomized TSR values. Kaplan-Meier survival analyses were performed and log-rank statistics were used to test differences in both disease-specific survival (DSS) and disease-free survival (DFS) distributions. DSS was defined as the time between the date of surgery and the date of death attributable to rectal adenocarcinoma. For DFS, the date of the first event of cancer recurrence was used, which could be loco-regional or a distant metastasis. In case no event occurred, the time period until the last date of follow-up was used in the survival analyses. Finally, both uni- and multivariate analyses were performed for TSR-visual and TSR-auto using the Cox proportional hazards model. Probability values < 0.05 (2-sided) were considered statistically significant.

3 Results

3.1 Clinicopathological data

Of 154 cases projected for inclusion in this study, twelve cases with mucinous carcinoma were excluded as these tumors exhibit largely different TSR values. Twelve other cases were excluded because, at the time of writing, the required slides or data were unavailable. One case was excluded because the corresponding tissue slide did not contain invasive carcinoma.

The median follow-up time for the remaining 129 patients used in the present study was 5.6 years (interquartile range 2.3–8.3). The median age of the patients at the time of surgery was 67 years (interquartile range 59–74). Further clinicopathological data can be found in Table 1. There was no significant correlation between the clinicopathological variables and assessed values of TSR-visual or TSR-auto (p > 0.05).

Table 1 Clinicopathological data for 129 rectal cancer patients in relation to TSR-visuala and TSR-auto

3.2 Performance of the deep learning system

Measures of sensitivity and specificity per tissue type as well as overall accuracy were assessed for the automatic method by pixel-wise comparison of predicted labels with ground truth labels in manually annotated regions. We found that the overall accuracy was 94.6%, which shows improvement on what was reported by Ciompi et al. [20]. Values of per-class sensitivity and specificity are reported in Table 2.

Table 2 Quantitative performance of the CNN at pixel classification per tissue class

Examples of tissue segmentation by the CNN in four circular regions selected by the observers are shown in Fig. 1. In line with the high classification accuracy, good segmentation of tumor, stroma and other tissues types was observed. Further qualitative inspection of the circular regions revealed some minor segmentation errors. Directly at the stroma-tumor interface, a very thin band of stroma pixels is often misclassified as tumor. Likewise, however, small groups of tumor cells (e.g. tumor buds, or thin tumor structures) were sometimes misclassified as stroma.

3.3 Inter-observer and computer-observer agreement

The ICC between the two observers for the assessment of TSR was 0.736 (95% confidence interval (95% CI) 0.646–0.806). The co-occurrence of TSR scores assessed by the two observers is depicted in Fig. 2. The ICC’s between TSR-auto and TSR-visual were 0.475 (95% CI 0.330–0.598) and 0.411 (95% CI 0.257–0.545) for observers 1 and 2, respectively.

Fig. 2
figure 2

Scatter plot of assessed stroma percentages in 129 patients for Observer 1 and Observer 2. The co-occurrence of assessed percentages is indicated by circles with areas proportional to the amount of patients scored with the corresponding TSR value. The dashed lines represent the boundary between stroma-low and stroma-high cases according to the cut-off value determined in Mesker et al. [10]. Green circles indicate cases where the observers agreed (105 in total) and red circles indicate disagreement (24 in total)

A moderate agreement between the two observers (κ = 0.578) was found after dichotomizing TSR-visual on basis of the 50% cut-off as described in section 2.5. Using the identical cut-off for TSR-auto, we observed only a fair agreement between TSR-visual and TSR-auto (κ = 0.239). Agreement improved considerably (κ = 0.521) when the median was used as cut-off for TSR-auto, resulting in: stroma-low = TSR-auto ≤ 65.47% and stroma-high = TSR-auto > 65.47%. Patients assigned to stroma-low or stroma-high groups by the observers and the automatic method are detailed in Tables 3, 4 and 5.

Table 3 Cross-tabulation of Observer 1 versus Observer 2 after dichotomisation
Table 4 Cross-tabulation of TSR-visual (consensus) versus TSR-auto(50%) after dichotomisation
Table 5 Cross-tabulation of TSR-visual (consensus) versus TSR-auto(median) after dichotomisation

3.4 Survival analyses

Survival analysis generally showed a worse outcome for stroma-high patients compared to stroma-low patients (Fig. 3), independent of the method of TSR assessment used (visual versus automated). For TSR-visual, the 5-year survival rates for stroma-low versus stroma-high cases were 71.0% versus 58.8% for DSS and 65.6% versus 49.1% for DFS. For TSR-auto(50%), the 5-year survival rates for stroma-low versus stroma-high cases were 86.6% versus 60.7% for DSS and 76.8% versus 54.9% for DFS. For TSR-auto(median), the 5-year survival rates for stroma-low versus stroma-high cases, were 76.1% versus 58.4% for DSS and 70.0% versus 50.7% for DFS.

Fig. 3
figure 3

Kaplan-Meier curves for disease-specific survival (top row) and disease-free survival (bottom row) of stroma-low versus stroma-high patients. Results based on all patients (n = 129) are shown for TSR-visual (a, d) and TSR-auto(median) (b, e). Results for patients with stage II rectal cancer (n = 45) are shown for TSR-auto(median) only (c, f). Log-rank test p-values are shown in the graphs

For TSR-visual, a significantly lower DSS was seen in the stroma-high group compared to the stroma-low group (p = 0.042), but not for DFS (p = 0.182). Similarly, for TSR-auto(50%) this difference was significant for DSS (p = 0.018), but not for DFS (p = 0.066). For TSR-auto(median), both DSS and DFS were found to be significantly lower in the stroma-high group compared to the stroma-low group (p = 0.007 and p = 0.021, respectively). After stratification for TNM stage, stroma-high was also found to be associated with worse survival in stage II rectal cancer patients (n = 45), but this result was only significant for TSR-auto(median) (DSS p = 0.003 and DFS p = 0.015).

Hazard ratios (HR) and 95% CIs were determined for both DSS and DFS (Tables 6 and 7). In univariate analysis, all methods for TSR assessment were found to be prognostic for DSS: TSR-visual HR = 1.83 (95% CI 1.01–3.30); TSR-auto(50%) HR = 2.71 (95% CI 1.14–6.40); and TSR-auto(median) HR = 2.31 (95% CI 1.24–4.30). For DFS, only TSR-auto(median) was found to be prognostic with HR = 1.96 (95% CI 1.10–3.51). After stratification for TNM stage, only TSR-auto(median) was found to be prognostic for stage II rectal cancer patients, both for DSS (univariate HR = 4.13 (95% CI 1.53–11.16)) and DFS (univariate HR = 3.05 (95% CI 1.19–7.81)).

Table 6 Uni- and multivariate Cox regression analysis for disease-specific survival
Table 7 Uni- and multivariate Cox regression analysis for disease-free survival

In multivariate analysis, automated TSR assessment was found to be prognostic independent of age, gender, pT-stage, lymph node status, tumor grade, and whether adjuvant therapy was given, both for DSS: TSR-auto(50%) HR = 3.11 (95% CI 1.26–7.70) and TSR-auto(median) HR = 2.48 (95% CI 1.29–4.78), and for DFS: TSR-auto(50%) (HR = 2.39 (95% CI 1.07–5.38)) and TSR-auto(median) (HR = 2.05 (95% CI 1.11–3.78)). TSR-visual was not found to serve as an independent prognostic factor.

4 Discussion

For different cancer types, TSR has been shown to yield prognostic information. Visual assessment of TSR requires training, and may be difficult for cases close to the decision threshold of 50%. The present study shows that specifically for rectal adenocarcinoma the observer agreement is only moderate. Recent advances in slide scanning technology and machine learning have opened up new possibilities for computerized assessment of TSR. To the best of our knowledge, the present study shows for the first time that TSR can reliably be assessed by an automatic deep learning algorithm. The agreement of the automated system (using median cut-off) with the observer consensus (kappa = 0.521) was comparable to the inter-observer agreement (kappa = 0.578). The TSR assessed in this manner appeared to be a strong independent prognostic factor both for DSS and DFS in rectal adenocarcinoma. The prognostic value of the automated TSR was comparable to that assessed in consensus by two experienced observers for DSS in univariate analysis, but not in multivariate analysis. For DFS, only the automatically assessed TSR was significantly associated with outcome, both in univariate and multivariate analysis.

Interestingly, automated TSR (using the median as cut-off) showed prognostic value for TNM stage II patients. Clinically, this is a subgroup of patients for which post-operative treatment is still under debate and more research is needed [22, 23]. TSR can potentially help to direct this discussion and add information for a more personalized treatment of this patient category.

In a recent study, Scheer et al. [8] analyzed TSR on the same cohort of patients as used in the present study. However, rather than a hot-spot measure, the authors applied a scoring procedure in which an average TSR was assessed based on the entire tumor area in a slide. Also, they defined TSR as the carcinoma percentage (CP) and the estimated percentages were grouped using three categories (low-CP, intermediate-CP and high-CP). In univariate survival analysis, CP was found to be prognostic for DSS and DFS. With CP-high as baseline and after correction for age, grading, pathological T-stage, and adjuvant treatment, CP-intermediate was found to be correlated with worse DSS and DFS, however, this result was obtained only in the subset of lymph node metastasis negative cases (n = 94). In the present study, the prognostic value of TSR remained intact for the entire cohort of patients after correction for clinicopathological variables, including lymph node status. The most probable cause for this difference is the TSR scoring method. In the present study we decided to follow a more widely accepted scoring system, which appears to outperform methods where the overall tumor area is scored by averaging.

The results of our observer study indicate that TSR obtained by visual estimation serves as a prognostic factor of DSS (although not reaching statistical significance when correcting for other clinicopathological features), but not of DFS. Furthermore, only a moderate agreement was found between observers. These results are in contrast with previous studies [9, 10, 13] on TSR assessment on colon cancer. This discrepancy may be explained by the fact that compared to colon, the rectum bowel wall has a thicker muscle layer and in, some cases, it may be difficult to distinguish between stromal tissue and smooth muscle cells, especially with darker H&E-stained slides. Muscle tissue, which should be excluded from scoring, may therefore be interpreted as stromal tissue by one observer and not by the other. Furthermore, as shown in Fig. 2, most discrepancies (15/24 cases) are found around the cut-off point of 50%. Especially for these cases, computer-aided TSR assessment may be very useful.

For the automated method two different stroma cut-off values have been investigated in this study: the value used for the visual estimation (50%), and the median of measured TSR-auto values. We found comparable results for the two cut-offs, with a slightly higher hazard ratio for the 50% cut-off at the cost of a wider 95% confidence interval. However, since in general automated assessment of TSR yields higher stroma percentages than visual assessment, the use of a 50% cut-off for TSR-auto corresponded much less to TSR-visual compared to the use of the median cut-off (as is reflected in the kappa values). The optimal cut-off value for TSR-auto should be further investigated and validated in an independent cohort.

It is worth noting that one of the patient inclusion criteria for the cohort that was used in this study was the absence of neoadjuvant treatment. The reason for this design choice, originally made by Scheer et al. [8], was that both chemotherapy and radiotherapy modifies the tissue architecture and, as such, may hamper the assessment of TSR or its prognostic value. The proposed method can, therefore, aid clinicians in selecting the right treatment options for rectal cancer patients who did not receive preoperative (chemo)radiotherapy. Furthermore, given the fact that the colon and the rectum are parts of the same continuous organ and have a similar histological appearance, the presented deep learning algorithm has the potential to be successfully applied to the analysis of colon cancer as well.

The deep learning-based approach proposed in this work needs the position of a user-provided stroma hot-spot as input in order to assess TSR. After this manual input is provided, the proposed method can process the hot-spot area in the whole-slide image automatically. As such, human input is still required, making the method only semi-automatic. It is worth noting that in Ciompi et al. [20] a computer model similar to the one used in this work has shown a high performance at segmenting several tissue types in rectal cancer at the whole-slide image level, i.e., beyond the limited area of the selected hot-spot. As a consequence, this method has the potential to be used to assess TSR both at whole-tumor level and at whole-slide image level. Such an approach would overcome the need for a user-provided stroma hot-spot and, therefore, allow investigating TSR at very large scale via fully-automatic computation. Future work will be directed towards further automation of TSR assessment and validation in a large independent cohort.

Although, to the best of our knowledge, TSR assessment (visual or automated) has not yet been implemented in routine pathology diagnostics, it was recently reported [24] that the TNM Evaluation Committee (UICC) and the College of American Pathologists (CAP) have discussed TSR and acknowledged its potential for integration with the TNM staging system. To achieve this for colon cancers, we are currently investigating the reproducibility of (visual) TSR assessment in a large European multicenter study [25]. The results of the present study suggest that automated TSR can potentially be of significant aid to pathologists in routine diagnostics. However, validation of the proposed technology on a larger and independent data set is essential and, therefore, among our future research goals. The objectiveness of a deep learning-based method, which allows obtaining accurate and reproducible quantification of TSR, has the potential to pave the way to implementation of TSR in clinical practice.