Deep learning for automatic Gleason pattern classification for grade group determination of prostate biopsies

Histopathologic grading of prostate cancer using Gleason patterns (GPs) is subject to a large inter-observer variability, which may result in suboptimal treatment of patients. With the introduction of digitization and whole-slide images of prostate biopsies, computer-aided grading becomes feasible. Computer-aided grading has the potential to improve histopathological grading and treatment selection for prostate cancer. Automated detection of GPs and determination of the grade groups (GG) using a convolutional neural network. In total, 96 prostate biopsies from 38 patients are annotated on pixel-level. Automated detection of GP 3 and GP ≥ 4 in digitized prostate biopsies is performed by re-training the Inception-v3 convolutional neural network (CNN). The outcome of the CNN is subsequently converted into probability maps of GP ≥ 3 and GP ≥ 4, and the GG of the whole biopsy is obtained according to these probability maps. Differentiation between non-atypical and malignant (GP ≥ 3) areas resulted in an accuracy of 92% with a sensitivity and specificity of 90 and 93%, respectively. The differentiation between GP ≥ 4 and GP ≤ 3 was accurate for 90%, with a sensitivity and specificity of 77 and 94%, respectively. Concordance of our automated GG determination method with a genitourinary pathologist was obtained in 65% (κ = 0.70), indicating substantial agreement. A CNN allows for accurate differentiation between non-atypical and malignant areas as defined by GPs, leading to a substantial agreement with the pathologist in defining the GG. Electronic supplementary material The online version of this article (10.1007/s00428-019-02577-x) contains supplementary material, which is available to authorized users.


Introduction
Prostate cancer is the second-most diagnosed cancer among men, accounting for approximately 25% of cancer cases in the western world [1]. It has been suggested that these high incidence rates are caused by widespread prostate-specific antigen (PSA) screening and subsequent biopsy harvesting [2].
Pathological grading of prostate cancer is originally based on the sum of the two most common Gleason patterns (GPs), called the Gleason score (GS) [3]. The initial Gleason grading system defines five histological patterns, with a focus on atypical glandular structures. GP 1 represents well-differentiated carcinoma, whereas GP 5 is defined as the least-differentiated carcinoma with complete loss of glandular structures. The intermediate scores are based on a linear scaling between the two extremes. Updates of the ISUP guidelines discouraged the assignment of Gleason scores 2-4. This was due to the poor reproducibility, poor correlation with radical prostatectomy grade, and deception of clinicians and patients, believing that Marit Lucas, Ilaria Jansen, Daniel M. de Bruin and Henk A. Marquering contributed equally to this work.

This article is part of the Topical Collection on Quality in Pathology
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00428-019-02577-x) contains supplementary material, which is available to authorized users.
there was an indolent tumor [4]. Better correlation with clinical outcome was achieved by the introduction of the modified Gleason score (GS) where the most frequently found GP and the highest GP are summed up.
The recently proposed grade groups (GGs) [5,6] are aimed to more accurately predict the prognosis of patients. Even though the GG classification results in prognostic distinct grade groups [7], similar inter-observer variability rates to conventional Gleason scoring have been reported [8,9]. Therefore, to avoid suboptimal treatments [5,10,11], an accurate and reproducible method to stratify the tumors is needed.
With the introduction of whole slide image (WSI) scanners, the digitization of slides has opened up the opportunity for computer-aided diagnosis (CAD), which has the potential to aid the pathologist and reduce inter-observer variability [12,13]. Several studies have presented automated differentiation of GPs [12,[14][15][16]. Convolutional neural networks (CNNs), a deep learning approach particularly useful for the classification of images, nowadays allow the computer to automatically find the best set of image-based features. These features are able to distinguish between the predefined classes [17] without the dependency on extensive pre-processing or human knowledge. Litjens et al. [18] were able to automatically differentiate between tumorous and non-tumorous prostate biopsies using a CNN. Ing et al. [19] used semantic segmentation for the grading whole mount radical prostatectomy sections. Källén et al. [20] differentiated between GP 3 and GP 5 yielding an accuracy of 81% in homogeneous GP regions of interest within a biopsy. In this study, we propose an approach in which we include the extent of GP 3 and GP 4 patterns in heterogeneous biopsies for a whole slide GG classification.

Patient selection
The Institutional Review Board of the Amsterdam University Medical Centers (UMC), location AMC, Amsterdam (W18_056 # 18.074) granted approval for this study. Hematoxylin and eosin (H&E) tissue sections were retrieved from the archives of the department of Pathology of the Amsterdam UMC, location AMC. The sections originated from patients that underwent a diagnostic biopsy between 2015 and 2017 (n = 38). The H&E-stained 4-μm-thick sections were digitized using a Philips UltraFast scanner (Philips Digital Pathology Solutions, Best, the Netherlands) and the WSIs were exported at 20× magnification, resulting in a pixel resolution of 0.5 μm. A total of 96 tissue sections were included, which can contain multiple biopsies or biopsy fragments, derived from 38 patients, with a median of two tissue blocks per patient and an interquartile range of 1 to 4.

Reference standard: manual annotations
The digitized slides were manually annotated by one of the two expert observers (I.J., KK.d.L.) and subsequently checked by a genitourinary pathologist (CD.S-H.) using an in-house developed free-hand annotation tool [21] (see Fig. 1). The first class was the unaffected stroma (connective tissue) of the prostate and was assigned to all pixels that were not in the proximity of other annotations. The non-atypical glands, including both healthy glands and glands with low-grade prostatic intraepithelial neoplasia (LGPIN), were defined as the second class. The third class was GP 3 and the fourth class consisted of GP ≥ 4 with the affected stroma. As the incidence of GP 5 was very low in this dataset, the GP 4 and GP 5 were merged to balance the classes. Subsequently, the GG for each biopsy was determined based on the surface area of GP 3 and GP ≥ 4 for each biopsy. As no differentiation was made between GP 4 and GP 5, a slightly adjusted grouping was applied, see Table 1. When it was unclear whether a gland was benign or malignant, the slides that were immunohistochemically stained with p63-AMACR or 34betaE12 were retrieved from the archive, if available, and inspected. Regions in which grading was impossible due to out-of-focus, tissue folds, or excessive ink and regions in which the immunohistochemical staining was inconclusive or unavailable were excluded from the study. Moreover, all regions with high-grade prostatic intraepithelial neoplasia were excluded from the study.

Convolutional neural network
Training a CNN requires large datasets with considerable variation, which are often not available in medical diagnostics.
Here, we present a study exploiting a large amount of pixels and substantial variations in tumor glands in each WSI to train a CNN [18]. By training a CNN on detailed GP annotations, we aim to make accurate differentiation of GP and GG in heterogeneous prostate biopsies.

Patch generation
Patches were extracted from the annotated RGB images. The patch size was required to be 299 × 299 pixels, which correspond to an area of approximately 150 × 150 μm 2 . Patches (with possible overlap) were randomly extracted from the image using MATLAB R2015b, MathWorks, Natick MA USA. The central pixel of the patch defined the class. As CNNs require huge datasets for training, data augmentation was applied. Rotation by 90, 180, and 270°, as well as horizontal and vertical mirroring, was applied to all patches.
Based on the number of extracted patches of each class, the patches were grouped in four balanced partitions. In these balanced partitions, a biopsy could only be present in one partition. Within these partitions, the number of patches in each class was reduced to equal the smallest class in all partitions.

CNN architecture
The CNN was trained based on three of these balanced partitions (which added up to approximately 268,000 patches) that were designated as the training set, while the fourth partition was designated as the test set and was used for crossvalidation (with approximately 89,000 patches). This procedure allowed us to study the performance of the CNN four times. The CNN (Inception v3 architecture) [22] was retrained using CNTK, which is an open source deep learning toolkit for image recognition [23]. This CNN is composed of various layers of Inception modules and two classifying layers [22]. The CNN results in a probability of a patch belonging to each of the four tissue classes. Specifications of the network can be found in Table S1.

Post-processing
The probabilities provided by the CNN were used to differentiate between non-atypical tissue (non-atypical gland patches with unaffected stroma patches), GP 3 and GP ≥ 4, by using a crossvalidated support vector machine. Next, by assigning the probability of each patch belonging to one of these three classes, probability maps were generated. Each patch of the test set was assigned to the class according to the highest probability.
The percentages of GP 3 and GP ≥ 4 classified of randomly selected patches of a biopsy were used to classify the slides according to the adjusted GGs. Using Table 1, the majority and minority of the automatic classified GPs patches are summed up (e.g., GP ≥ 4 + GP 3 = adjusted GG 3). In case that only one GP was present, this GP was doubled (e.g., GP 3 + GP 3 = adjusted GG 1). At least 4.5% of the patches needed to be positively identified for each class (GP 3 and GP ≥ 4) to reduce the influence noise for the adjusted GG determination.
Post hoc visual evaluation of the probability maps was performed to identify possible causes of false positive regions of the methodology.

Accuracy analysis
The three assigned classes were represented in a confusion matrix for comparison with the manually depicted class. The confusion matrix was subsequently dichotomized to calculate the sensitivity, specificity, and accuracy. The F-measure was used as an accuracy measure. The F-measure (F 1 ) considers both precision and recall and is defined as F 1 = 2 (precision × recall)/(precision + recall). The patches were dichotomized between non-atypical and malignant tissue (GP ≥ 3). Subsequently, we also assessed the accuracy for differentiating GP ≤ 3 from GP ≥ 4. This differentiation has been used as a measure to determine the need of treatment [24].
The kappa-statistic (κ) was used to calculate the concordance between the GG classifications and the reference standard. A quadratic weighted kappa was used in which disagreement on the ordinal GG scale was not assumed to be equally important [25].

Results
Training of each CNN took approximately 175 h. Because only minor differences existed between the performance of the four trained networks, only the results of one representative trained network are illustrated. The ratio of correct-pixelclassified patches was 93% for non-atypical patches, 73% of GP 3, and 77% of GP ≥ 4. The confusion matrix of the classifications is presented in Table 2.
The differentiation between non-atypical and malignant (GP ≥ 3) areas had an accuracy of 92%, with a sensitivity and specificity of 90 and 93%, respectively. The F-measure was 0.93.
The differentiation between GP ≥ 4 and GP ≤ 3 was accurate for 90%, with a sensitivity and specificity of 77 and 94%, respectively, and with an F-measure of 0.81.
An example of a probability map for malignant tissue is shown in Fig. 2 and a probability map for GP ≥ 4 is shown in Fig. 3. Visual inspection of the probability maps, resolved false-positive regions at tissue folds and at regions that were either out-of-focus or obscured by the presence of ink. Another major contributor to false-positive regions is the border of the biopsies. In these regions, incomplete glands as well as cutting artifacts are mostly present.
Concordance of the full biopsy-based adjusted GG classification was obtained in 65% (N = 40) of the biopsies, resulting in a κ of 0.70 (see Table 3), indicating substantial agreement.

Discussion
We have demonstrated that a CNN is accurate in the differentiation of GP 3 and GP ≥ 4 from non-atypical tissue for prostate biopsies with good accuracy. Probability maps of GPs showed good visual agreement, suggesting that CNNs can be a valuable tool for computer-aided diagnosis. Determination of the adjusted GG based on the presence of the GPs showed substantial agreement with the reference standard.

Comparison with current literature
The agreement between human and automated GG classification is in line with the inter-observer agreement between two general pathologists described by Ozkan et al. [8] This conformity indicates that the GG classification agreement cannot be further improved since higher concordance with one observer would result in a lower concordance with another. Other automated methods for GP classification are mainly based on the automated detection of glands and afterwards the extraction of hand-crafted features, such as the gland and lumen surface area, for classification [12,14,15]. Consequently, some regions of GP 4 can be missed by these automatic detection methods, as glandular structures are largely reduced and affected here [15]. CNNs have proven to be useful for classification of prostate biopsies. Litjens et al. [18] automatically differentiated between tumorous and nontumorous prostate biopsies. Källén et al. [20] made further efforts to differentiate between GP 3 to GP 5. In their study, they only used homogeneous single-class regions of interest and classified whole biopsies based on the most prominent GP within the slide. The patch-based performance of Källén et al. is also exceeded by our study [20]. In particular, the accuracy of the differentiation between GP 3 and GP 4 patches is higher in this study, and differentiation between GP 3 and GP 4 has the biggest clinical implications. Differentiation between GP 3 and GP 4 is very often problematic [26,27]. For instance, fused or small glands without lumina can be categorized as either GP 3 or GP 4 [26]. The training using detailed annotations in this study might

Future perspective
Automated diagnosis has the potential to reduce both the workload of and variability between pathologists [12,13].
CNNs have already outperformed pathologists with a timeconstraint in the detection of breast cancer metastasis in the lymph nodes [28].
Although the patch-based classification can be considered accurate, classification results may be improved by the introduction of extensive post-processing. By taking into account more information of the neighborhood, conditional random fields (CRFs), among others, have the potential to improve the label assignment [17].
For the generalizability of future CNNs, the dataset should be annotated by multiple genitourinary pathologists in order to decrease the influence of inter-observer variability. In this dataset, special attention should be paid to include more patients with a (heterogeneous) GP5. This would allow the system to classify according to the official GG instead of the adjusted GG. Moreover, biopsies from multiple institutions should be incorporated due to the differences in appearance of biopsies, among others by different staining protocols. The benefits of the inclusion of more biopsies from different hospitals are twofold. First of all, it makes the applied methodology more robust against differences in appearance of the biopsy, and secondly, it results in an improvement of the performance of the CNN.

Limitations
This study suffered from a number of limitations. The differentiation between GP 3 and GP 4 was based on the annotations made by two trained observers and one expert genitourinary pathologist, although it is known from the literature that a large degree of variation may exist between the diagnoses of individual pathologists. The same holds for the annotation of LGPIN. However, we assume that the precise annotation of the glands on high-resolution images, as well as the twostaged delineation process, resulted in a reliable dataset.
As only data of 96 tissue sections from 38 different patients were included, we partitioned the data based on biopsies rather than on patients. This approach may have resulted in an overestimation of the accuracy, as patient-specific patterns can be present in both training and testing partitions. However, we found no indication of overfitting to patient-specific patterns. In patients present in only a single partition, visual inspection of these biopsies shows similar performance as patients present in multiple partitions. To improve the performance of the CNN, false-positive regions caused by tissue folds, out-offocus, borders of the biopsy, and ink should be automatically excluded, as these regions may distract the attention from real findings. Nonetheless, the automated adjusted GG determination displayed a comparable agreement than the reported interobserver agreement in Ozkan et al. [8], as the majority of the patients in the test-set are in GG 1. Differentiation between adjusted GG 2 and adjusted GG 3 is still challenging, while this differentiation has the largest clinical implications for patients. Unfortunately due to the low presence of GP 5, this study introduced the adjusted GG. Therefore, the proposed methodology is aimed at the localization and differentiation of GPs in whole needle biopsies. This can help the pathologists in the detection and suggest the GPs in prostate biopsies.

Conclusions
We demonstrate the feasibility to train a CNN to differentiate between GPs in heterogeneous biopsies. Good differentiation between non-atypical tissue and tumorous tissue is achieved, as well as a substantial agreement in GG classification between the automated method and the specialized genitourinary pathologist.