Keywords

1 Introduction

Statistical learning approaches, primarily those embodied by deep learning, have demonstrated the potential for advancing our ability to extract information from histology images. The concept of end-to-end learning has been applied to predict cancer grade [1], genotype [2], and outcome [3] directly from the digitised haematoxylin and eosin (H&E) images. As opposed to summarising the vast amount of information in the form of a single number or category, we aim to capture potentially diagnostically relevant information and to support a more objective decision making process. Providing a dense segmentation of the entire image is a challenging and important first step towards achieving this goal.

Tissue architecture is characterised by an organ-specific hierarchical assembly of various components (e.g. stroma, epithelium, glands, blood vessels), their shape and topology. Progressing disease can severely disrupt this multi-scale organisation. Examples like those shown in Fig. 1 illustrate how the increased amount of visual context improves the likelihood of correct identification. Classical medical imaging and computer vision research provides numerous examples on how information from multiple scales can be utilised. More recently, various deep learning approaches [4] have been introduced that effectively learn visual context directly from training data. With this paper we provide a more systematic comparison of these approaches and study how these effect the ability of differentiating between different tissue components. In addition, we introduce a computational model that utilises feature sharing across scales and learns dependencies between scales using long-short term memory (LSTM) unit [5].

An openly available collection of breast cancer samples [6] and a local collection of prostate cancer histology provide the necessary disease context. An overview of the relevant deep learning approaches is provided in Sect. 2. The set of architectures that are being used for a comparison and details of the datasets used in this study are being presented in Sect. 3. Our results in Sect. 4 give a strong indication that the modelling visual context impacts the quality of dense segmentation of histology images. While these results are extremely encouraging, we need to take shortcomings of the datasets into account. In our conclusions we outline what future studies are necessary to overcome the bias included in the present datasets.

Fig. 1.
figure 1

Visual context. The different images of the scene containing a jumping cat effectively highlight that the correct interpretation of a scene depends on visual context. We content the accuracy of dense segmentation of histology images into different tissue types depends on our ability to make effective use of multiple scales.

2 Related Work

Two main approaches to medical image segmentations are semantic segmentation and patch-wise classification. For example, Ronneberger et al. [7] incorporate a dense prediction step in their U-Net convolutional neural network (CNN) architecture, which has been applied with great success to a range of biomedical applications. Zhang et al. [8] use a patch-based CNN approach to segment regions of infant MR brain images. In whole slide histology image segmentation, patch-based prediction models appear to dominate the landscape since the lack of comprehensive annotation ground-truth prohibits the use of semantic segmentation approach. Patch-based approaches have proven successful in various applications [9, 10]. To detect cancer metastases in breast atypical lymph nodes at a fine-grained level, Wang and colleagues [9] divide large whole slide images into small patches and employ a CNN to assign a prediction score to every patch. The final decision is aggregated from the micro predictions. Nonetheless, processing each patch independently does not take contextual information and long range spatial dependencies into account. To address this shortcoming, Moeskops et al. [11] extract patches of different sizes centred at the same pixel location. Each patch is processed on a separated branch of a CNN, yielding multiple-scale features which are then combined for the final prediction. Instead of extracting multiple patches at different scales, Kong et al. [12] use a CNN with a 2-dimensional long-short term (LSTM) architecture [4] to learn spatial dependencies of image patches and their neighbours. Incorporating multi-scale and contextual information into a patch-wise classification scheme is still an open problem. A systematic comparison of different network architectures is necessary to establish how visual context should be utilised in whole slide image segmentation.

Fig. 2.
figure 2

Used architectures. Model complexity and run time are specified in Table 1.

3 Methods

Comparative Methods. The 10 different architectures that are being used in this study are presented in Fig. 2. These can be categorised into three groups: (1) those that operates at a particular image resolution (A, B, C, D, and H), (2) those that fuse information at multiple resolutions before passing through a neural network (also known as early fusion approach, E), and (3) those that combine multi-scale output features from the networks before prediction (late fusion; F, G, I, and J). Two different approaches to late fusion are being considered. Architectures G and J apply an LSTM unit for integrating the multi-scale information, while methods F and I fuse the features as is. By using the same CNN for all scales (F & G) we test if it is beneficial to share features. In contrast, separate CNNs are being learnt for each spatial scale (I & J).

With the exception of model H the same CNN architecture is used to compare how the different approaches utilise the multi-scale context. It consists of 4 layers, where each layer starts with convolution of size \(4\times 4\), followed by a batch normalisation and rectify linear unit activation before downsampling. In method H, there are instead 8 of these convolutional layers. After these layers, the feature responses are fed to 2 fully connected layers, each with 512 hidden neural units. Dropout is employed after each fully connected layer. A linear classifer is used in the final layer of the network. The spatial dimensions of an input images in all methods is \(64\times 64\), except method H which uses large high-resolution images (\(512\times 512\)) as input. The LSTM layer has a hidden state of size 512. See Supplementary Material for the detailed definition of each model.

The data are separated into 56% for training, 14% for validation, and 30% for testing at the patient level. The methods are trained using ADAM optimiser [13] with an initial learning rate of 0.0002. The training processed is stopped once the validation loss is no longer improving. Otherwise, the training is abort after 100 epochs.

Datasets. Two datasets are employed for quantitative evaluation. Prostate: the dataset consists of 4 tissue classes, including benign, lumen, stroma, and normal. Image patches are extracted from 28 whole slide images at 4 resolutions (\(2.5{\times }\), \(5\times \), \(10{\times }\), and \(20{\times }\)). This implies that there are 4 images at an individual image location. There are no patches from the same whole slide image that appear in more than one of the training, validation, and test partitions. All annotations were provided by an expert prostate pathologist. In total, there are 41,442 patches at each scale before augmentation (lumen 8361, stroma 14547, benign 12,016, and tumour 6,518) Breast: this publically available dataset [6] consists of 4 tissue classes, namely normal, benign, in situ, and invasive. There are approximately 100 images in each classes. Training and test partitions are provided by the authors. Here, we extracted patches at 4 resolutions: \(1.25{\times }\), \(2.5{\times }\), \(5\times \), and \(10{\times }\). For each resolution, there are 27,060 patches before augmentation (normal 6,616, benign 7,050, in situ 8239, and invasive 5,155).

Performance Evaluation. Since the segmentation problem is treated as a patch-based classification in this study, we consider an F1-measure for the performance evaluation. F1-measure is mathematically equivalent to the Dice index, a standard measure for segmentation accuracy. Due to the stochastic nature in the training process of the algorithms, we trained each approach 3 times, and in the evaluation, we use the average value of true positives, false positives, and false negatives across the 3 runs.

Table 1. Classification accuracy as measured by the F1-measure. Bold indicates the best performance. Green, blue, yellow, and red colour codings indicate that the results are within 97.5%, 95%, 90%, and 85% of the best performance, respectively. This colour coding scheme can be used to rank the methods (bold = 1, green = 2, blue = 3, yellow = 4, red = 5, and no colour = 6). The overall ranking is summarised by the rank-sum. A total running time is measured on the test set of the prostate cancer data.

4 Results and Discussion

The results summarised in Table 1 provide some clear indication that including information from multiple different scales (E, F, G, I, and J) improves segmentation performance. When ranked with respect to performance, approaches that operate on a fixed resolution are clearly inferior. Rather than simply reporting out top performance with respect to each tissue category, we would like to highlight approaches that perform consistently well. A colour code is used in Table 1 to mark how each method relates to the top performer. In addition, we compute an accumulative rank.

While model H yields the top performance for selected classes, it also performs rather poorly on others. Given that this model performs extremely well on detecting stroma in prostate tissue, one could argue that it specialises on capturing certain texture patters extremly well. When comparing models G and J we can make some interesting observations. On the given data sets model G performs consistently well in all of the tissue classes and has the lowest accumulative rank. Only considering the prostate samples model J is clearly the best. However, the performance of this model degrades on the breast cancer cases. Here, the interplay between model complexity and size of the data set needs to be taken into account. Later we discuss this issue in more detail. Overall, these results support our hypothesis that visual context and scale matters in histology image classification problems.

Fig. 3.
figure 3

Resilience to noise. The percentage change in the F1-measure at different noise levels of model E (in red) and model G (in green) is shown. The performance at the zero noise level is used as a reference. On each scale, an image is randomly replaced by a noisy image (\(\sim \mathcal {N}(\cdot | \mu =127, \sigma ^2 = 1)\)) with probability \(p \in \{0.1, 0.3, 0.5\}\)

Dataset Size and Makeup. It is crucial to mention that we observe a high degree of visual variation within each class in the breast cancer data. But each of these categories only contains a limited number of instances. This has two major consequences. When compared to the prostate experiments, all of the methods perform worse. More importantly, the breast cancer data set disadvantages more complex architectures. For example, consider method I and its counterpart with a significantly smaller number of parameters, method F. There is a dramatic drop in the performance of method I in relative to that of method F in most of the tissue classes. The same behaviour can be observed across a pair of methods J and G and a pair H and D in some of the tissue classes. This is why we need to interpret the results obtained on this breast data set with great caution.

Feature Integration. From Table 1, the models which utilise a LSTM unit (G and J) perform better than their counterparts with no LSTM (F and I) in most of the cases. Importantly, the LSTM unit also improves the resilience to noise. The direct comparison between models E and G shown in Fig. 3 shows the percentage change in the F1-measure when we contaminate the images with noise. As one would expect, G is more resilient to noise as the percentage of reduction in the performance is consistently smaller than that of strategy E across tissue classes in both datasets.

Computation Efficiency vs Accuracy. Especially when working on whole side images, computational efficiency needs to be taken into account. Here, memory usage and running time are important factors. In terms of the number of parameters, methods E, F and G and have a significantly lower number of parameters than H, I, and J. Based on the trend observed in the prostate dataset there is a possibility that when trained on more samples and longer methods I and J will yield an even better performance. On the other hand, methods E, F, G, I, and J run significantly faster than H (Table 1). In the medical context, the cost of running time has more weight than the cost of memory usage (number of parameters). As such, method H, which operates on the highest image resolution and full image dimensions, is considered very costly without offering significant improvement in the performance.

Table 2. Effect of the order of image scales. F1-measures of the method with a single stream of CNN and LSTM (G) subjected to different sequences of the image scale orders.
Fig. 4.
figure 4

Segmentation example. This whole slide segmentation was obtained with architecture G on one prostate cancer sample. Benign, tumour, lumen, and stroma regions are highlighted in orange, yellow, green, and purple. Note that the background white region is intentionally highlighted in green. Figures B and C correspond to the areas marked by rectangles in figure A and B.

Sensitivity of Method G to the Order of Image Scales. To inspect whether the order of image scales affects the performance, we considered the following sequences of scale orders: (1) low to high, (2) high to low, (3) random (\(5{\times } \rightarrow 2.5{\times } \rightarrow 20{\times } \rightarrow 10{\times }\) for the prostate and \(2.5{\times } \rightarrow 1.25{\times } \rightarrow 10{\times } \rightarrow 5{\times }\) for the breast dataset), and (4) bidirectional (low \(\leftrightarrow \) high). From Table 2, there is no strong difference between F1-measures produced by different experimental conditions. This implies that the performance of the method G does not depend on the order of the image scales.

5 Conclusions

To address the lack of comprehensive annotations we cast the segmentation problem as a patch-based classification rather than semantic segmentation task. In summary we conclude:

  • Visual context: Our results support the claim that incorporating larger context produces superior results.

  • Feature integration: LSTM units effectively capture the dependencies between different scales and generally improves performance. LSTMs are resilient to noise and not sensitive to the order of inputs.

  • Dataset design: Small datasets typically do not represent the true variation of the data. Real clinical samples should be used for validation.

In addition, we have introduced a computationally efficient model (G) which performs well on various different tissue categories. Visual inspection of the segmentation results on whole slide images of this approach also looks highly encouraging (Fig. 4). To overcome the problem of insufficient training data we aim at establish a standard dataset which includes data and annotation from multiple institutions. In addition to the manual annotation, immunohistochemistry staining will be considered to provide biological ground truth.