1 Introduction

Computers and digital technologies are especially apparent in the field of histopathology for microscopic analysis of tissue samples to provide diagnostic insights. Researchers have focused on delivering diagnostic services using state-of-the-art AI solutions [1,2,3]. But these technologies are notoriously data-hungry. The transmission and storage of high-resolution histopathology images are laborious and quite costly. The need to improve the current image compression algorithms that would enable these images to be compressed to satisfying levels without impairing their diagnostic value is ever-increasing [4]. It is no longer necessary to be physically near the microscope or the site at which the tissue sample is taken to analyse it and provide a diagnosis for the patient. Moreover, these digital Whole Slide Images (WSIs) can also be transferred to a remote location through the Internet for further requirements. Having sufficient broadband connections to transmit WSI images for telepathology is just scratching the surface [5, 6]. The objective of the work is to develop a simple technique to supervise and improve the performance of a compressive autoencoder, ensuring that most of the diagnosis-relevant data in whole slide histopathology images are retained in the compressed latent representation. It is worth noting that the main focus is on validating the theoretic feasibility of the proposed alternative techniques, rather than achieving similar or better results to the existing Neural Image Compression (NIC) ROI-based compression approaches [7].

Now a day, all diagnostic services are using cloud-based medical ecosystem. This vastly interconnected network is often referred to as the Computational Pathology (CPATH) and AI-assisted. Deep learning models are highly resourced intensive, and the data generation, training, and deployment of such models usually take place on different devices, often at distant locations also. Advances in computation and digital technologies improving quality and safety of recording and monitoring procedures, as well as speeding up diagnostic processes by aiding medical practitioners. Histopathology, in particular, has undergone major digitization in recent years. Scanning whole histopathology slides and storing them in high-resolution WSI opened the door to a wealth of opportunities for pathologists [5].

Our research intends to develop and validate a simple but novel method for directing the attention of a special kind of neural network that can compress image data, called an auto-encoder, to retain more of the meaningful information present in the input image. In the context of histopathological images, we target the compressed representation should contain more of the important features necessary for correct diagnosis when the autoencoder is supervised by a trained classifier network that is placed on top of it. The proposed supervision method does indeed promote beneficial changes in the autoencoder’s encoding process, which in turn improves the accuracy of the label predictions made by the classifier.

2 Background and literature review

This section presents the background and literature review on image compression.

Several studies have investigated the effects of lossy compression methods in healthcare and found that - depending on modality. More recently, Chen [7] in the field of computer-assisted diagnostics, and studied the effects of JPEG 2000 compression on the performance of deep learning models in detecting metastatic cancer cells in histopathological images. To secure personal health information from unauthorized access block chain might be considered [8].

Deep learning techniques have already been used in JPEG artefact removal with great results, replacing handcrafted model-based approaches [9]. In recent studies, neural networks are taking over the compression process entirely [10, 11]. Li and Ji [12] provide a great overview of the key concepts in NIC. Authors focused on general design and composition of NIC architectures; that given the successive encoding-decoding steps in compression tasks, they are all variations of a simple autoencoder design (Table 1).

Table 1 Summary of literature review

Numerous autoencoder variants have been developed over the years in a broad range of applications and successfully deployed in medical image noise removal tasks [18]. In these applications, skip-connections are usually added between the encoder and decoder to bypass the bottleneck and to improve the output image quality which information leakage would be problematic in a compression task. This structure strikingly resembles of the U-NET architectures used in medical image segmentation [19]. Autoencoders are often used to pre-train U-NET architectures to shorten their training time compared to randomly initializing U-NET weights [20]. Bedi and Gole [21] constructed an efficient hybrid Convolutional Neural Network (CNN) network with the pre-trained encoder bottom of a Convolutional Autoencoder (CAE) as the base, by which compressing the latent space representation of the input image—greatly reduced the number of trainable parameters in the classifier top, as well in the overall assemble network. One autoencoder variant that needs to be mentioned here is the variation-based Auto-encoder (VAE) which is only an autoencoder in name that is a reflection on the similar encoder-decoder structure. The primary purpose of VAEs is not dimensionality reduction, although they are often applied in image compression problems [22], but to learn the parameters of a probabilistic distribution over a regularized latent space. The reconstructed output is built by decoding the points sampled from this latent distribution, which enables the decoder to generate realistic content in high volume as shown in Fig. 1.

Fig. 1
figure 1

Sample images from the data set

3 Proposed methodology

In this section, we present the design of the autoencoder using supervised learning for the image compression. We state the complete design methodology of our proposed autoencoder in details along with the examples.

Autoencoder are neural networks comprised of an encoder bottom and a decoder top, and a bottleneck between them [23]. They are usually symmetric in that the decoder mirrors the encoder part in its general structure, but this is not a requirement. The key idea is that the output of an autoencoder is a reconstruction of the input and the network is optimized for minimizing the difference between them, i.e. the reconstruction error.

Assuming unlabelled training inputs \(x^{1},~ x^{2},~...,x^{n}\) and we can use it as in Eq. (1). The autoencoder inputs \(x \in [0,1]^d\) for encoding with the hidden representation \(y \in [0,1] ^d\) using mapping technique with a nonlinear function f as defined in Eq. (2).

$$\begin{aligned} z^{i}= & {} x^{i}\end{aligned}$$
(1)
$$\begin{aligned} y= & {} f(Wx + k)\end{aligned}$$
(2)
$$\begin{aligned} p= & {} f(W'y + k') \end{aligned}$$
(3)

The basic structure of autoencoder for an image is shown in Fig. 2. Since the bottleneck has fewer dimensions compared to the input and the output, the encoder produces a compressed latent representation of the original image which makes this architecture a choice for compression problems. This is in contrast with most neural networks that often expand on dimensionality when learning abstract semantic representations to solve supervised problems.

Fig. 2
figure 2

Autoencoder structure

For image data, convolutional autoencoders are the most common architecture of choice, and the loss functions are usually MSE, root MSE (RMSE) or variations of the other similarity metrics discussed. Irrespective of the specific architecture design or particular loss function/distortion metric used. Autoencoders will learn how to encode the input in a compressed latent representation and how to reconstruct the original image from that code by minimizing the difference between the input and output image. In the case of convolutional autoencoder the weights are distributed among all the inputs and the reconstruction image y can be defined using Eq. (4), where c is the input bias per channel, G is the set of latent features, and W is flip operation computed between two weight dimensions.

$$\begin{aligned} y = f(\sigma (g^{i}*W^{i} + c)) \end{aligned}$$
(4)

The trouble with compressing images with autoencoders is that the process is unsupervised, although sometimes autoencoders are referred to as semi or self-supervised models [24]. A deep autoencoder optimizing on a MSE loss function, and as such it is an unsupervised process, regardless of what semantic representation may or may not be present in the latent space. When it comes to compressing WSIs this way, the features retained by the autoencoder might be the best ones to minimize the average difference between the original and reconstructed images, but they might not contain the information needed for histopathological diagnosis, which calls for the need to be able to supervise autoencoders to make sure that they retain the necessary information? Le and White [25] devised a Supervised Autoencoder (SAE) by adding a supervised loss (from label prediction) on the latent representation layer. By doing so, they managed to direct their autoencoder in representation learning towards features that were more important for the classification task.

In SAE, the supervised loss is included with the presentation layer, while in case of single hidden layer network it is added with the output layer. Assuming k is the size of a single hidden layer, \(F \in R^{d*k}\) represents the weights of the first layer, \(W_{p} \in R^{k*m}\) stands for weights of output layer to predict y, weights \(W_{r} \in R^{k*d}\) to reconstruct x, \(L_{p}\) as supervised loss and \(L_{r}\) as reconstruction error. The objective is defined in Eq. (5) [25] as:

$$\begin{aligned} \begin{aligned}&\frac{1}{t} \sum ^{t}_{i=1}[ L_{p}(W_{p}F_{x_{i}},x_{i})+L_{r}(W_{r}F_{x_{i}},x_{i})] = \frac{1}{2t} \\&\qquad \sum ^{t}_{i=1}[ \Vert W_{p}E_{x_{i}}-y_{i} \Vert ^{2}_{2}+ \Vert W_{r}E_{x_{i}}-y_{i} \Vert ^{2}_{2}] \end{aligned} \end{aligned}$$
(5)

The major challenge in image compression is information loss in the designed models. Researchers have proposed various loss functions depending upon the requirements and models. Let’s assume supervised loss as \(L_{s}\) with it related weight \(W_{s},~ L_{r}\) as reconstruction error and its weight as \(W_{r}\). We define the loss function L(F) in Eq. (6) as:

$$\begin{aligned} L(F)= \frac{1}{t} \sum ^{t}_{i=1} L_{s}(W_{s}F{x_{i}},y_{p,i})+L_{r}(W_{r}F_{x_{i}},y_{r,i}) \end{aligned}$$
(6)

Ultimately, the question if a reconstructed WSI after compression contains the necessary information or not boils down to whether the correct diagnosis can be made. However, visual evaluation of reconstructed images by actual histopathologists to assess the performance of a compressive autoencoder, especially during the development process, is simply unfeasible. More importantly, it isn’t really necessary after all. Consider a sample image (Fig. 1) from the Kaggle website, it describes: “A positive label indicates that the center 3232p region of a patch contains at least one pixel of tumor tissue. The outer region of tumor is provided to enable fully CNN models that do not use zero-padding, to ensure consistent behaviour for whole-slide image.”

Fig. 3
figure 3

Presentation of the proposed model

Optimizing an autoencoder on the diagnostic performance and/or subjective evaluation of actual histopathologists is not a viable method. Thus a well-trained classifier network is added rather than reasonable approach that provides the basis for the chosen methodology. Our proposed model is presented in Fig. 3. The objective was to build an ensemble of neural networks that enables a compressive autoencoder in a supervised fashion to retain a denser and more meaningful representation of the input histology images.

One key advantage of using an autoencoder to compress images is that it can be broken up after being trained, and the encoding and decoding can be done independently and on separate devices. This means that it is only the compressed representation of the original image that needs to be transmitted between them, thus reducing network traffic. The type of autoencoder chosen for the ensemble was a simple compressive CNN.

Algorithm 1
figure a

Whole Slide Histopathology Image Classify_DL

Although vibrational autoencoder seem to be in vogue for their probabilistic modelling approach and generative capabilities, reproducibility and interpretability are paramount in medical diagnostics. Therefore, the more straightforward design of a simple autoencoder was deemed more adequate. The network artefacts of simple convolutional autoencoders are well known, have a highly regular pattern and can be minimized with careful network design or could be later removed from the reconstructed images if need be.

For developing the classifier, the most promising and reasonable approach was to use transfer learning, i.e. taking a publicly available pre-trained model and adapting it to the task at hand. This method seemed more efficient and reliable than designing and training a network from scratch. That being said, the resulting model did not need to achieve stunning results, but it had to perform reasonably well on the classification task to be able to supervise the Auto-encoder.

4 Simulation

The simulation is carried out in a Windows 10 operating system (with Intel i7 2.20GHz CPU, 8GB RAM and NVIDIA GeForce GTX 1050Ti GPU) in Python for Keras API and NVIDIA CUDA enabled TensorFlow-GPU as backend. It was also useful in creating and managing the training and validation subsets of the image data. Several autoencoder architectures were built and tested with different compression ratios: reducing the dimensionality to 33% (6\(\times\)6\(\times\)256), 66% (12 \(\times\) 12 \(\times\) 128) and 83% (12 \(\times\) 12 \(\times\) 160) of that of the original input images (96 \(\times\) 96 \(\times\) 3). Figure 4 gives a clear comparative view of both the original and reconstructed image after reducing dimensionality.

Fig. 4
figure 4

Comparison of the original and reconstructed images with different compression rates

4.1 Data preparation

The data set chosen is the PCam data set [26] curated for the Histopathologic Cancer Detection competition (Histopathologic Cancer Detection, 2018) and contained 220,025 labelled patches of lymph node sections for training purposes, as well as 57,458 unlabeled patches for testing; in total close to 8 GB of data. The dimensions of the images were 96,\(\times\) 96 pixels. They had binary labels with a slight label imbalance in the training data: 59.5% of the images were labelled cancer cell-free and 40.5% of the images had a positive label.

4.2 Compressive autoencoder

Several autoencoder architectures are built and tested with different compression ratios: reducing the dimensionality to 33% (6\(\times\)6\(\times\)256), 66% (12\(\times\)12\(\times\)128) and 83% (12\(\times\)12\(\times\)160) of that of the original input images (96\(\times\)96\(\times\)3). As opposed to common practice in autoencoder design, the output of the encoder, i.e. the compressed code, was not flattened, as it was regarded as an unnecessary step. Although, it certainly could be done in the future if needed. The encoder and decoder parts are symmetric in all autoencoder. They all followed a similar structure of having 3–4 blocks containing: (3\(\times\)3) convolutional layers with ’same’ padding followed by batch normalization and ReLU activation function. In the encoder, this is followed by (2\(\times\)2) max pooling layers whereas in the decoder they are followed by (2 \(\times\) 2) up-sampling instead. The autoencoder is trained for 100 epochs, optimizing on MSE loss function as the measure of difference between the input and output images. By virtue of having such small models, the training never took longer than 15 min.

Fig. 5
figure 5

Autoencoder summary

Figure 5 depicts the performance evaluation and ultimately the selection of autoencoder models based on visual comparison of their respective output. Other major factor in selection of the right autoencoder to be used in the ensemble is influenced by the compression rate. Higher compression rates allow for more pronounced effects under the supervised training in terms of what features should be retained in the compressed representation. Thus, the autoencoder that reduced the dimensionality of the original images to 66% was deemed as shown in Fig. 5.

4.3 Classifier

Several models available from the Keras API were experimented with; e.g. EfficientNetB2, ResNet50, VGG16 and Xception which were already pre-trained on the ImageNet data set. The approach was used to adapt these models for binary classification by loading base model without the top layers and then adding a global average pooling layer and a dense layer with a single node using sigmoid activation function. But, All of the layers in the base models are set as non-trainable and as gradually training progressed, more and more layers are set as trainable allowing the model to adapt to the image classification problem which is also referred to as fine-tuning in transfer learning. However, validation loss is carefully monitored to avoid overfitting the models during the training. Check-points/call-backs are often used for the models and saved in states during the validation process to improve performance.

Data augmentation is also deployed when required for rotating to flip the images either vertically or horizontally. The optimizer in all training was Adam, with a learning rate between 0.01–0.0001 as per situation requirement. The loss functions in all cases are binary cross-entropy. The classifier with the Xception base is selected to“ supervise” the autoencoder in the ensemble with 0.95 accuracy and AUC-ROC score on the validation (uncompressed) data set. To illustrate the training history and performance of this model, graphs are shown in Fig. 6.

Fig. 6
figure 6

Classifier training with different performance measure

In Fig. 6, the classifier is trained for 300 epochs in total and for the first 80 epochs only the classifier top is trainable. The last/top convolutional block of the exception base model is set trainable for the next 120 epochs during which the accuracy on the validation and training images starts to diverge. For 176 000 training images and a batch size of 64 after 200th epoch data augmentation is implemented (rotation, horizontal/vertical flipping) to avoid overfitting along with training of weight adjustment.

4.4 Ensemble

After a suitable compressive autoencoder and a classifier which correctly predict the target labels, or rather accurately enough, the procedure follows as:

  • Initially, set all the weights in the classifier untrainable.

  • Add untrainable weight on top of the trained autoencoder.

  • Train the whole assembly to accurately classify images.

Before the training of Xception classifier, the optimizer and loss function are the same. Since the weights of the classifier top cannot be trained thus the features used to predict the labels cannot be changed either or too. Moreover, the model is a kind of model that “knows what it’s looking for”. The AE will learn to prioritise features that the classifier needs to make accurate diagnoses and retain them in the compressed latent code, something along the lines of a ROI-based compression method. The regions of interest are dictated by the supervising classifier.

Figure 7 depicts the accuracy rate of the ensemble in the first epoch was quite low but gradually significant increment during the training, exceeding 0.9 by the 100th epoch. To avoid model state preservation when it has over-fitting during the training data, the evaluation loss and accuracy are closely monitored and checkpoints were created accordingly. The improvement in accuracy and AUC measures indicate the autoencoder is encoded to better the input images.

Fig. 7
figure 7

Ensemble training with different performance measure

5 Results

It is being said that, 20% percent of the training data is set aside for evaluation purposes during training and model development and in these cases accuracy and f1 scores are also utilized as represented in Fig. 8. The scores calculated based on the predictions made by the task-adapted Xception classifier on different images and depicted in Fig. 8. The left side of the figure shows the scores obtained on images from the validation data set and on the right from a test data set. The magenta bars show the AUC-ROC score for the label predictions of the original, uncompressed images. The grey bars indicate the classifier’s performance on the images reconstructed by the unsupervised autoencoder. The blue bars mark the classifier’s performance on the images reconstructed by the supervised autoencoder that has been trained in the ensemble.

Fig. 8
figure 8

AUC-ROC scores of classifier performance

The classifier performs rather clumsily on the images which compressed and reconstructed by the unsupervised autoencoder, i.e. the simple compressive AE that has not been trained in the ensemble; with AUC-ROC scored just above 0.5 and that is close to the effectiveness of random guessing. In fact, it’s even worse than simply predicting the more prevalent negative diagnosis (approximately 60% of the images have 0/negative labels) which signifies the unsupervised autoencoder is not very good at image compression.

The resultant images of Fig. 9 indicate that not only is the supervising ensemble capable of influencing the autoencoder towards the features or regions of the images to retain in the latent representation but that these features are influencing the reconstructive images. It certainly enables the classifier to overall improve more accurate label predictions. In Fig. 10 the top row contains samples of the original, uncompressed images that are annotated with their true and predicted labels. The second or middle row shows the same images but after being compressed and reconstructed by the unsupervised AE, their labels are predicted by the classifier. The bottom row contains the images (tagged with the ’Reconstructed FT’ title) from the supervised autoencoder.

Fig. 9
figure 9

Sample images to illustrate the effects of compression with the unsupervised and supervised AEs

Fig. 10
figure 10

Examples of the images reconstructed by the selected autoencoder

To illustrate and assess the qualitative differences between the compression process and the reconstructed images with sample images have been provided in Fig. 10 using autoencoder. This could indicate that the supervised autoencoder is prioritizing the spatial qualities and structural elements of the images, which are more informative for the label predictions, rather than overall colour. Once the image is reconstructed by the supervised AE, the checkerboard network artefacts seem to be more pronounced in the peripheral regions and around the edges of the images, which could indicate a positive adaptation to the task, since, according to the data set description as it is only the center regions that inform the image labels. Similarly, these artefacts seem to be more pronounced in the white/empty regions of the images reconstructed by the supervised AE, compared to the unsupervised ones. This further supports the previous observation of the supervised AE placing more attention on regions important for the classification.

Based on the findings described in the previous results, it has been demonstrated that the proposed methodology of supervising a compressive autoencoder by embedding it in an ensemble with a trained classifier top, is a viable approach for enriching the compressed latent representation with semantically meaningful information, and ultimately improving the quality of the compressed images. Our idea further supports the findings and extends the work done by Le, Patterson and White [25] in their efforts to create a supervised Auto-encoder that can be utilized in the field of ROI-based compression, by providing an additional technique to the repertoire that can possibly be used in conjunction with existing methods. Furthermore, the fact that we focus specifically on WSI compression and use histopathological data to demonstrate the viability of the proposed method suggests that this might be a viable compression technique in CAD and CPATH applications; fields that are in dire need of such solutions due to the exceptionally high volume of data.

6 Conclusions and future work

The main contributions of this work to the field of ROI-based neural compression is proof on concept; providing evidence for the viability of a new and simple supervision method for improving the performance of compressive neural networks. The proposed model prioritises features that are more important to correctly classify histopathology images when choosing what to retain in the compressed feature space. It would be rather advantageous to be able to objectively measure as well as visualise the structural differences between the images reconstructed by unsupervised and supervised CAEs.