A Deep Learning-based Compression and Classification Technique for Whole Slide Histopathology Images

This paper presents an autoencoder-based neural network architecture to compress histopathological images while retaining the denser and more meaningful representation of the original images. Current research into improving compression algorithms is focused on methods allowing lower compression rates for Regions of Interest (ROI-based approaches). Neural networks are great at extracting meaningful semantic representations from images, therefore are able to select the regions to be considered of interest for the compression process. In this work, we focus on the compression of whole slide histopathology images. The objective is to build an ensemble of neural networks that enables a compressive autoencoder in a supervised fashion to retain a denser and more meaningful representation of the input histology images. Our proposed system is a simple and novel method to supervise compressive neural networks. We test the compressed images using transfer learning-based classifiers and show that they provide promising accuracy and classification performance.


Introduction
Computers and digital technologies are taking over the world of medicine.This is especially apparent in the field of histopathology which focuses on the microscopic analysis of tissue samples to provide diagnostic insights.Numerous healthcare providers, companies, and startups focus on delivering diagnostic services using stateof-the-art AI solutions [1].But these technologies are notoriously data-hungry.The transmission and storage of high-resolution histopathology images are laborious and quite costly.The need to improve the current image compression algorithms that would enable these images to be compressed to satisfying levels without impairing their diagnostic value is ever-increasing [2].
It is no longer necessary to be physically near the microscope or the site at which the tissue sample is taken to analyse it and provide a diagnosis for the patient.Moreover, these digital Whole Slide Images (WSIs) can also be transferred to a remote location through the internet for further requirements.The evolution of digital pathology is firmly underpinned by the development and improvement of Internet technology and cloud architectures in healthcare.Having sufficient broadband connections to transmit WSI images for telepathology is just scratching the surface [3].The objective of the work is to develop a simple technique to supervise and improve the performance of a compressive autoencoder, ensuring that most of the diagnosis-relevant data in whole slide histopathology images are retained in the compressed latent representation.It is worth noting that the main focus is on validating the theoretic feasibility of the proposed alternative techniques, rather than achieving similar or better results to the existing NIC and ROI-based compression approaches [4].
Our research intends to develop and validate a simple but novel method for directing the attention of a special kind of neural network that can compress image data, called an autoencoder, to retain more of the meaningful information present in the input image.In the context of histopathological images, we target the compressed representation should contain more of the important features necessary for correct diagnosis when the autoencoder is supervised by a trained classifier network that is placed on top of it.The proposed supervision method does indeed promote beneficial changes in the autoencoderr's encoding process, which in turn improves the accuracy of the label predictions made by the classifier.

Background and literature review
This section presents the background and literature review on image compression.When the compression rate of the lossless algorithm is not sufficient, it is necessary to allow for some information to be lost in the encoding process.Since the human eye can deal with a certain level of distortion or corruption in images, by removing this psycho-visual redundancy via quantization, much higher compression rates can be achieved.Hussain et al. [2] provide a good overview of the existing lossy compression techniques in their article.JPEG 2000 is the most widespread compression standard that allows for both lossless and lossy compression.A notable characteristic of JPEG 2000, and a major improvement over its predecessor algorithm (JPEG) that is worth mentioning here, is that it allows for user-controlled parameterization which enables different regions of images to be compressed at different rates [5].Other commonly used evaluation metrics are Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM).Unfortunately, none of these metrics truly align with human evaluation of perceptual similarity [6].
In histopathology and medical imaging, the general standard is lossless compression, and it is preferred over lossy compression techniques so as not to jeopardise diagnostic performance [7].However, several studies have investigated the effects of lossy compression methods in healthcare and found that -depending on modality.More recently, Chen [4] in the field of computer assisted diagnostics, and studied the effects of JPEG2000 compression on the performance of deep learning models in detecting metastatic cancer cells in histopathological images.In the case of WSIs, a large percentage (around 80%) of the slide area can be diagnostically irrelevant background or empty space, but compression artefacts in regions where tissue is present is highly undesirable [8].
CNNs can learn to abstract semantic representations of the content in images, and they are the de facto tool of choice in computer vision problems.When trained in a supervised fashion, they can learn what information is important for the task at hand and what can be dismissed [9].Deep learning techniques have already been used in JPEG artefact removal with great results, replacing handcrafted model-based approaches [10].In terms of using CNNs in the compression process itself, utilized CNN to flag regions of interest, which were in turn compressed with higher bit rate using a JPEG2000 encoder [11].In recent studies, neural networks are taking over the compression process entirely [12], Li and Ji [13] provide a great overview of the key concepts in neural image compression (NIC).In their article they make a keen observation about the general design and composition of NIC architectures; that given the successive encoding-decoding steps in compression tasks, they are all variations of a simple autoencoder design.

Proposed Methodology
In this section, we present the design of the autoencoder using supervised learning for the image compression.Autoencoder are neural networks comprised of an encoder bottom and a decoder top, and a bottleneck between them [14].They are usually symmetric in that the decoder mirrors the encoder part in its general structure, but this is not a necessary requirement.The key idea is that the output of an autoencoder is a reconstruction of the input and the network is optimized for minimizing the difference between them, i.e. the reconstruction error.
Assuming unlabelled training inputs x 1 , x 2 , ..., x n and we can use it as in Eq. ( 1).The autoencoder inputs x ∈ [0, 1] d for encoding with the hidden representation y ∈ [0, 1] d using mapping technique with a nonlinear function f as defined in Eq. ( 2).
The basic structure of autoencoder for an image is shown in Figure 2. Since the bottleneck has fewer dimensions compared to the input and the output, the encoder produces a compressed latent representation of the original image which makes this architecture a choice for compression problems.This is in contrast with most neural networks that often expand on dimensionality when learning abstract semantic representations to solve supervised problems.
For image data, convolutional autoencoders are the most common architecture of choice, and the loss functions are usually MSE, root MSE (RMSE) or variations of the other similarity metrics discussed.Irrespective of the specific architecture design or particular loss function/distortion metric used, autoencoders will learn how to encode the input in a compressed latent representation and how to reconstruct the original image from that code by minimizing the difference between the input and output image.In case of convolutional autoencoder the weights are distributed among all the inputs and the reconstruction image y can be defined using Eq. ( 4), where c is the input bias per channel, G is the set of latent features, and W is flip operation computed between two weight dimensions.

Supervised autoencoders
The trouble with compressing images with autoencoders is that the process is unsupervised, although some times autoencoders are referred to as semi or self supervised models [15].A deep autoencoder optimizing on a MSE loss function, for example, is ultimately solving a regression problem (albeit in a non-linear fashion).And as such it is an unsupervised process, regardless of what semantic representation may or may not be present in the latent space.When it comes to compressing WSIs this way, the features retained by the autoencoder might be the best ones to minimize the average difference between the original and reconstructed images, but they might not contain the information needed for histopathological diagnosis, which calls for the need to be able to supervise autoencoders to make sure that they retain the necessary information?Le and White [16] devised a Supervised Autoencoder (SAE) by adding a supervised loss (from label prediction) on the latent representation layer.By doing so, they managed to direct their autoencoder in representation learning towards features that were more important for the classification task.
In SAE, the supervised loss is included with the presentation layer, while in case of single hidden layer network it is added with the output layer.Assuming k is the size of a single hidden layer, F ∈ R d * k represents the weights of the first layer, W p ∈ R k * m stands for weights of output layer to predict y, weights W r ∈ R k * d to reconstruct x, L p as supervised loss and L r as reconstruction error.The objective is defined in Eq. ( 5) [16] as: The major challenge in image compression is information loss in the designed models.Researchers have proposed various loss functions depending upon the requirements and models.Let's assume supervised loss as L s with it related weight W s , L r as reconstruction error and its weight as W r .We define the loss function L(F ) in Eq. ( 6) as: Ultimately, the question if a reconstructed WSI after compression contains the necessary information or not boils down to whether the correct diagnosis can be made.However, visual evaluation of reconstructed images by actual histopathologists to assess the performance of a compressive autoencoder, especially during the development process, is simply unfeasible.More importantly, it isn't really necessary after all.Consider a sample image (Figure 1) from the Kaggle website, it describes: "A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue.Tumor tissue in the outer region of the patch does not influence the label.This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behaviour when applied to a whole-slide image." As it has been noted, optimizing an autoencoder on the diagnostic performance and/or subjective evaluation of actual histopathologists is not a viable method.However, utilizing the next best Fig. 3 Presentation of the proposed model thing, a well-trained classifier network to do the same is a rather reasonable approach that provides the basis for the chosen methodology.Our proposed model is presented in Figure 3.The objective was to build an ensemble of neural networks that enables a compressive autoencoder in a supervised fashion to retain a denser and meaningful representation of the input histology images.
The key steps of the proposed method were as follows: 1. Build and train a simple autoencoder to compress histopathological images.

Train a classifier on the original, uncompressed
images to correctly identify label/diagnosis.3. Build an ensemble with the trained autoencoder base and trained classifier top.4. Keep the classifier top untrainable and optimize the ensemble to correctly classify the reconstructed images.5. Since it is the only part that has trainable weights, the autoencoder should learn prioritize features important for classification (that the classifier top is looking for).
One key advantage of using an autoencoder to compress images is that it can be broken up after being trained, and the encoding and decoding could be done independently and on separate devices.Which means that it is only the compressed representation of the original image that needs to be transmitted between them, thus reducing network traffic.
For developing the classifier, the most promising and reasonable approach was to use transfer learning, i.e. taking a publicly available pretrained model and adapt it to the task at hand.This method seemed more efficient and reliable than designing and training a network from scratch.

Compressive autoencoder
Several autoencoder architectures were built and tested with different compression ratios: reducing the dimensionality to 33% (6x6x256), 66% (12x12x128) and 83% (12x12x160) of that of the original input images (96x96x3).As opposed to common practice in autoencoder design, the output of the encoder, i.e. the compressed code, was not flattened, as it was regarded as an unnecessary step.Although, it certainly could be done in the future if needed.The encoder and decoder parts were symmetric in all autoencoder.
They all followed a similar structure of having 3-4 blocks containing: (3-by-3) convolutional layers with 'same' padding followed by batch normalization and ReLU activation function.In the encoder this was followed by (2-by-2) max pooling layers whereas in the decoder they were followed by (2-by-2) up-sampling instead.The autoencoder were trained for a 100 epochs, optimizing on MSE loss function as the measure of difference between the input and output images.By virtue of having such small models, the training never took longer than 15 minutes.

Classifier
The method of choice to develop the classifier, as previously discussed, was transfer learning.Several models available from the Keras API were experimented with; e.g.EfficientNetB2, ResNet50, VGG16 and Xception that were already pretrained on the ImageNet data set.The approach was used to adapt these models for binary classification by loading base model without the top layers and then adding a global average pooling layer and a dense layer with a single node using sigmoid activation function.But, all of the layers in the base models were set as non-trainable and as gradually training progressed, more and more layers were set trainable allowing the model to adapt to the specific computer vision and image classification problem which also referred as fine-tuning in transfer learning.However, validation loss is carefully monitored to avoid overfitting of the models during the training.Check-points/callbacks are often used for the models and saved in states during validation process to improve performance.
Data augmentation is also deployed when required for rotating to flip the images either vertically or horizontally.The optimizer in all training was Adam, with a learning rate between Fig. 6 Classifier training with different performance measure 0.01-0.0001as per situation requirement.The loss functions in all cases are binary cross-entropy.The classifier with the Xception base is selected to" supervise" the autoencoder in the ensemble with 0.95 accuracy and AUC-ROC score on the validation (uncompressed) data set.To illustrate the training history and performance of this model, graphs are shown in Figure 6.
In Figure 6, the classifier is trained for 300 epochs in total and for the first 80 epochs only the classifier top is trainable.Then, last/top convolutional block of the exception base model is set trainable for the next 120 epochs during which the accuracy on the validation and training images started to diverge.For 176 000 training images and a batch size of 64 after 200th epoch data augmentation is implemented (rotation, horizontal/vertical flipping) to avoid overfitting along with training of weight adjustment.

Ensemble
After a suitable compressive autoencoder and a classifier which correctly predict the target labels, or rather accurately enough, the procedure follows as: • Initially, set all the weights in the classifier untrainable.• Add untrainable weight on top of the trained autoencoder.• Train the whole assemble to accurately classify images.
Before the training of Xception classifier the optimizer and loss function are the same.Since, the weights of the classifier top cannot be trained thus the features used to predict the labels cannot be changed either or too.Moreover, the model is a kind of model which "knows what it's looking for".The AE will learn to priorities features that the classifier needs to make accurate diagnosis and   The classifier performs rather clumsily on the images which compressed and reconstructed by the unsupervised autoencoder, i.e. the simple compressive AE that has not been trained in the ensemble; with AUC-ROC scored just above 0.5 and that is close to the effectiveness of random guessing.In fact, it's even worse than simply  predicting the more prevalent negative diagnosis (approximately 60% of the images have 0/negative labels) which signifies the unsupervised autoencoder is not very good at image compression.
The resultant images of Figure 9 indicate that not only is the supervising ensemble capable of influencing the autoencoder towards the features or regions of the images to retain in the latent representation, but that these features are actually influencing the reconstructive images.It certainly enables the classifier to overall improve more to accurate label predictions.In the Figure 10 the top row contains samples of the original, uncompressed images that are annotated with their true and predicted labels.The second or middle row shows the same images but after being compressed and reconstructed by the unsupervised AE, their labels predicted by the classifier.The bottom row contains the images (tagged with 'Reconstructed FT' title) from the supervised autoencoder.
To illustrate and assess the qualitative differences between the compression process and the reconstructed images with sample images have been provided in Figure 10 using autoencoder.This could possibly indicate that the supervised autoencoder is prioritizing the spatial qualities and structural elements of the images, which are more informative for the label predictions, rather than overall colour.Once the image reconstructed by the supervised AE, the checkerboard network artefacts seem to be more pronounced in the peripheral regions and around the edges of the images, which could actually indicate a positive adaptation to the task, since, according to the data set description as it is only the center regions that inform the image labels.Similarly, these artefacts seem to be more pronounced in the white/empty regions of the images reconstructed by the supervised AE, compared to the unsupervised ones.This further supports the previous observation of the supervised AE placing more attention on regions important for the classification.

Conclusions and future work
The main contributions of this work to the field of ROI-based neural compression is proof on concept; providing evidence for the viability of a new and simple supervision method for improving the performance of compressive neural networks.The exact specifications of the compression architecture may not be important; this method would possibly work for other compressive neural networks.Furthermore, if a diagnostic health-care provider already has a working, state-of-the-art classifier neural network, which they probably do, they now also have in their hands a quick and simple tool to improve upon their existing compression models and ultimately to decrease traffic in their IoT network.

Fig. 1
Fig. 1 Sample Images from the Data Set

Fig. 4
Fig. 4 Comparison of the original and reconstructed images with different compression rates.

1
Environment set upThe simulation is carried out in a Windows 10 operating system (with Intel i7 2.20GHz CPU, 8GB RAM and NVIDIA GeForce GTX 1050Ti GPU) in Python for Keras API and NVIDIA CUDA enabled TensorFlow-GPU as backend.The data set chosen is the PCam data set[17] curated for the Histopathologic Cancer Detection competition (Histopathologic Cancer Detection, 2018) and contained 220 025 labelled patches of lymph node sections for training purposes, as well as 57458 unlabeled patches for testing; in total close to 8 GB of data.

Figure 5
Figure 5 depicts the performance evaluation and ultimately the selection of autoencoder models based on visual comparison of their respective output.Other major factor in selection of the right autoencoder to be used in the ensemble is influenced by the compression rate.Higher compression rates allow for more pronounced effects under the supervised training in terms of what features should be retained in the compressed representation.Thus, the autoencoder that reduced the dimensionality of the original images to 66% was deemed as shown in figure 5.

Fig. 7
Fig. 7 Ensemble training with different performance measure

Figure 7
depicts the accuracy rate of the ensemble in the first epoch was quite low but gradually significant increment during the training, exceeding 0.9 by the 100th epoch.To avoid model state preservation when it has over-fitting during the training data, the evaluation loss and accuracy are closely monitored and checkpoints were created accordingly.The improvement in accuracy and AUC measures indicate the autoencoder is encoded so as to betterment the input images.5 ResultsIt is being said that, 20% percent of the training data is set aside for evaluation purposes during training and model development and in these cases accuracy and f1 scores are also utilized as represented in Figure??.

Figure 8
shows the scores calculated based on the predictions made by the task-adapted Xception classifier on different images.The left side of the figure shows the scores obtained on images from validation data set, and on the right from test data set.The magenta bars show the AUC-ROC score for the label predictions of the original, uncompressed images.The grey bars indicate the classifier's performance on the images reconstructed by the unsupervised autoencoder.The blue bars mark the classifier's performance on the images reconstructed by the supervised autoencoder that has been trained in the ensemble.

Fig. 9
Fig. 9 Sample images to illustrate the effects of compression with the unsupervised and supervised AEs

Fig. 10
Fig. 10 Examples of the images reconstructed by the selected autoencoder