Two layer Ensemble of Deep Learning Models for Medical Image Segmentation

In recent years, deep learning has rapidly become a method of choice for the segmentation of medical images. Deep Neural Network (DNN) architectures such as UNet have achieved state-of-the-art results on many medical datasets. To further improve the performance in the segmentation task, we develop an ensemble system which combines various deep learning architectures. We propose a two-layer ensemble of deep learning models for the segmentation of medical images. The prediction for each training image pixel made by each model in the first layer is used as the augmented data of the training image for the second layer of the ensemble. The prediction of the second layer is then combined by using a weights-based scheme in which each model contributes differently to the combined result. The weights are found by solving linear regression problems. Experiments conducted on two popular medical datasets namely CAMUS and Kvasir-SEG show that the proposed method achieves better results concerning two performance metrics (Dice Coefficient and Hausdorff distance) compared to some well-known benchmark algorithms.


I. INTRODUCTION
Segmentation is the process of partitioning an image into multiple segments to locate objects and boundaries.Before the rise of Deep Neural Networks (DNN), most of the successful segmentation algorithms used hand-crafted features combined with a machine learning classifier such as Random Forest [1] or Support Vector Machine [2].Even though subsequent research have achieved noticeable improvements by incorporating richer context information [3] or by applying structured prediction techniques [4], [5], the performance of these systems remained limited because the hand-crafted features are not representative enough for real-world usage.With the success of DNNs in image classification in 2012 [6], researchers began to apply this new architecture to segmentation.Some notable results in this direction include Fully Connected Networks (FCN) [7] and SegNet [8].Applying deep learning techniques to medical imaging has brought many successes, such as the introduction of a novel architecture called Unet and successfully applied it to the segmentation of neuronal structures in electron microscopic stacks [9].This network continues to be widely used for segmentation.Another notable example is in [10] which used T1-weighted, T2-weighted, and fractional anisotropy image patches of 13x13 in size as input to a Convolutional Neural Network (CNN) for segmentation of infant brains which are considered to be much more difficult than adult brains.This approach outperforms other commonly used segmentation algorithms when tested on a set of manually segmented isointense stage brain images.Deep learning methods are highly effective for cases when the dataset is large.For example, the first success in deep learning was a network trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset [6], which contained 1 million annotated images.However, medical image datasets are much smaller, usually about 1,000 images [11].This creates an important challenge for creating deep medical models which are robust against overfitting.Another problem is that popular optimizers for training deep neural networks such as Stochastic Gradient Descent (SGD) generally require much manual tuning of optimization parameters [12].Despite the fact that there has been some alternative methods which require less parameter tuning, such as Adam [13], these methods do not generalize as well as SGD [14].The manual parameter tuning causes a challenge in selecting suitable deep models for a specific problem.Therefore, because medical image analysis requires reliable predictions from automated systems due to its critical nature, it is essential to leverage the strong points of multiple segmentation algorithms to improve on the final results.
Ensemble learning is a popular technique in which multiple learners are combined to make a collaborated decision.The key challenge is to build an effective ensemble method to combine the results of segmentation algorithms.The paper is organized as follows.In section 2, we briefly review the existing approaches relating to segmentation in medical image analysis and the ensemble learning.In section 3, we propose a novel two-layer ensemble method to combine the results of segmentation algorithms.Because segmentation gives a pixel-level output, the prediction results by the segmentation algorithms are concatenated with the original image as input to segmentation algorithms in the second layer.Dice Coefficient and Hausdorff distance are used as the evaluation metrics.The details of experimental studies on two public datasets are described in section 4. Finally, the conclusion is given in section 5.

II. BACKGROUND AND RELATED WORK A. Semantic segmentation in medical image analysis
With the success of [6] in applying deep Convolutional Neural Network (CNN) to the problem of image classification, deep learning has become the most popular approach in computer vision.Since then, many notable deep architectures have been proposed to solve vision problems.For example, VGG16 [15] was a deep CNN for image classification using a stack of convolution layers with small receptive fields in the first layers instead of few layers with big receptive fields arXiv:2104.04809v1[cs.CV] 10 Apr 2021 like previous models.This allows the model to have much fewer parameters and more non-linearity, which makes the decision function more discriminative and the model easier to train.VGG16 managed to achieve a top-5 accuracy of 92.7% on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)-2013 dataset.Another notable model is ResNet [16], which was motivated by the problem of training a really deep architecture.The network uses shortcut connections in order to perform identity mapping, i.e. instead of learning a function, the layers having shortcut connections learn the residual mapping.This allows Resnet to have a very deep network at 152 layers while achieving 96.4% accuracy on the ILSVRC-2016 competition.
Generally, deep image classification models are trained on large datasets, such as ImageNet [17] which have around 1 million images.However, in the problem of semantic segmentation, in which a model must predict the class of each pixel in the image, the scale of available datasets is not as large as in image classification [18].To overcome this limitation, practitioners usually use a pre-trained classification network and finetune it for segmentation.Most deep learning based semantic segmentation architectures are inspired by Fully Convolutional Network (FCN) [7], which creates a segmentation network by using an existing classification network and replace the fully connected layers with convolutional ones to output spatial maps instead of classification scores.Those maps are then upsampled to produce dense pixel-level output.This architecture is considered the cornerstone of deep learning applied to semantic segmentation [18].Another notable example is DeepLab [19] which makes use of Conditional Random Fields (CRF) [20] as a post-processing step for the refinement of the segmentation result.The proposed architecture models each pixel as a node in the random field and employs a fully connected factor graph in which one pairwise term is used for each pixel pair irrespective of their distance.This allows the model to incorporate both short-range and longrange information into account, facilitating the restoration of detailed structures in the segmentation process that was lost due to the spatial invariance of CNN.
Segmentation is considered one of the most essential medical imaging process as it extracts the region of interest (ROI) which is then used in clinical applications.Therefore, it has seen the widest variety of proposed methodology, including deep architectures specifically designed to tackle problems in medical image analysis.A notable example is UNet [9] which consists of a contracting path and an expanding path designed symmetrically.To help with localization, high resolution features from the contracting path are combined with the upsampled output.An important difference of UNet compared to previous architectures is that the upsampling part also has a large number of feature, channels, which allow the network to propagate context information to higher resolution layers.The network does not have any fully connected layer and therefore can be trained on images of arbitrary size via an overlap-tile strategy.In recent years, Recurrent Neural Networks (RNNs) have also become widely used for medical image segmen-tation.For example, in [21] a spatial clockwork RNN was used to segment perimysium in histopathology images.The authors applied the RNN four times in different orientations in order to incorporate bidirectional information from left/top and right/bottom neighbors.For 3D brain segmentation, [22] trained a 3D-CNN by using mini-batches of multiple cubes, whose size was larger than the input size.Their proposed model could take an arbitrary-sized 3D patch as input and would output a block of predictions per input, which is similar to FCN.Over four different brain segmentation datasets, their proposed method achieved the highest average specificity measure, with no significant loss in sensitivity.Some researchers have also used graphical models such as Conditional Random Fields as a post-processing step to refine the segmentation results [23].

B. Ensemble learning
Ensemble learning is a popular approach in machine learning for combining a collection of classifiers for the collaborative decision.Designing an ensemble system requires two stages, namely ensemble generation and ensemble integration.In the ensemble generation, multiple classifiers are generated by using either a homogeneous strategy (training a learning algorithm on multiple training sets generated from the original training data) [24], [25] or a heterogeneous strategy (training different learning algorithms on the original training data) [26], [27].A combining method is then used to aggregate the predictions of the constituent classifiers in the ensemble integration stage to obtain the collaborated prediction.Several top-performing methods for classification have been reported including Random Forest [28], XgBoost [29], and Rotation Forest [30].
Recently, there is increasing interest in the ensemble generation inspired by the success of DNNs.Instead of using only one layer like in traditional ensemble models, the ensemble systems were made to train deeply through multiple layers.The first deep ensemble system was proposed by Zhou and Feng [31] (called gcForest), containing multiple layers of two Completely-Random Tree Forests and two Random Forests in each layer.Each forest in a layer outputs a class vector, which is then concatenated to the original data as the input data to the next layer.Utkin et al. [32] proposed a weighted average approach for gcForest by associating each tree with a weighted vector for its class distribution vector.The optimal weight vectors of each trees in one layer are found by minimizing the distance between the class label vector in a binary encoding scheme and the weighted prediction vector of this forest.The authors proposed to set only a weight vector for each group in order to reduce the computational overhead.Nguyen et al. [33] proposed MULES, a deep ensemble system with classifier and feature selection in each layer.The optimal configuration of each layer is found by using a bi-objective optimization problem in which the two objectives to be maximized are classification accuracy and diversity of the ensemble in each layer.Qi et al. [34] introduced a deep ensemble model in which each layer consists of an ensemble of Support Vector Machine (SVM) classifiers [35].The model parameters, such as the kernel functions of the SVM classifiers, the number of classifiers, and the weights of the features are found by AdaBoost [36].

III. PROPOSED METHOD
Our proposed method is inspired by multi-layer ensemble learning architectures, in which the segmentation algorithms in one layer train the segmentation models of this layer on the new training data generated by the preceding layer [31].Applied to segmentation of medical images, this facilitates the successive refinement of segmentation results through each layer.It is recognized that the most successful segmentation algorithms in recent years have been based on DNNs [37], and even though deep learning models can be trained in parallel using GPU, a multi-layer ensemble model of deep learning-based segmentation algorithms would require a lot of computational resources.Therefore, an important question arises: How many layers should a deep ensemble model extend?[33] showed that on some datasets, the number of layers of multi-layer ensemble obtained was 2 or 3 only.Based on this observation, we introduce a novel two-layer ensemble model for segmentation of medical images.Figure 1 shows the high-level overview of our proposed method.

A. Two-layer ensemble for segmentation
n=1 be the training set where N is the number of images, I n is an input image of size (W, H, C) with H being the image height, W the image width, and C is the number of channels (C = 1 for grayscale, C = 3 for color image).The mask Y n is also an image of size (W, H), with each entry Y n (i, j)(i = 1, ..., W ; j = 1, ..., H) showing which group the pixel I n (i, j) belongs to, i.e Y n (i, j) ∈ Y, where Y = {y m }, m = 1, ..., M is the set of all classes and M is the number of classes.
We aim to learn a hypothesis h : I n → Y n (i.e segmentation model) to approximate the unknown relationship between each image and its corresponding mask, and then use this hypothesis to assign a label for each unsegmented image.We also denote {K k } K k=1 as the set of K segmentation algorithms.Each segmentation algorithm K k learns on D to obtain a trained segmentation model h k .In ensemble learning, we train segmentation algorithms In the next step, we generate the training data for the second layer of ensemble.Based on the results of [33] and the stacking generalization model [26], we propose a two-layer deep ensemble architecture for segmentation in medical image analysis (Figure 1).Firstly, the training set D is divided into . Then for each part D t (t = 1, ..., T ), the segmentation algorithms {K k } K k=1 will learn on its complementary D \ D t to obtain segmentation models h k,t .The images in D t are then segmented by using these segmentation models.Let P k (y m |I n (i, j)) be probability prediction that h k,t assigns pixel I n (i, j) to be in class y m .The prediction of h k,t showing the probability all pixels of image I n belonged to class y m is given by a matrix: For each image I n , there will be M × K prediction matrices P k (y m |I n ) illustrated in Figure 2. In this study, we propose augment the training data for the second layer of ensemble by concatenating these M × K prediction matrices to the original training images to create new images I * n .The prediction matrix {P k (y m |I n )} serves as an additional channel of the original image I n .In total, the new images I * n will have C + M × K channels: The new training data for the second layer of ensemble will be given as follows: For second layer of the ensemble, we train {K k } K k=1 on D * to get trained segmentation models {h * k } K k=1 .We then need to train a combiner C to combine the trained models ĥ t to obtain segmentation models h * k,t .These models will then predict on D * t .The second-layer probability prediction for all images in D * is given as follows: Normally, a learning algorithm trains the combiner on L * with given labels of each pixel to combine the prediction of segmentation models for the final prediction.It is noted that each row of L * is the probability predictions by K segmentation models on a pixel of each training image.Therefore L * will be a matrix of N × W × H rows and M × K columns.With a large training set and large image sizes, the size of L * will be very large.For example, on Kvasir-SEG dataset of 800 training images with image size of (640, 544), the matrix L * will have 800 * 640 * 544 = 278528000 rows.The large size of L * causes a challenge for conventional machine learning algorithms to train the combiner on all data at once.In this paper, we use a weight-based combining method on the segmentation algorithms {h * k } K k=1 , in which each segmentation algorithm has its own weight in the combiner.The weights are found via an optimization method.This approach is practical to train the combiner on the whole L * at once.

B. Combining method
Let W = {w k,m } be the weight matrix, in which w k,m is the weight associated with the segmentation model h * k and class y m (k = 1, ..., K, m = 1, ..., M ).Since the class labels of the training observations are known in advance, the weights W can be obtained by exploring the relationship between the secondlayer probability predictions in L * and the class labels of the training pixels.The weight matrix is found by minimizing the difference between the prediction for pixel I n (i, j) and its true class label.From the second-layer probability prediction matrix L * , we extract the probabilities associated with class y m to create matrix of size (N × W × H, K): We also define crisp label vector having size (N ×W ×H, 1) associated with class y m as follows: where I[.] is the indicator function.The weight vector W m = {w k,m }, k = 1, ..., K of size (K, 1) for class y m is then found by solving a linear regression problem: W m can be imposed with different constraints, such as Non-Negative Least Squares, i.e. w k,m ≥ 0 [38], [39], Bounded Variable Least Squares, i.e. l k,m ≤ w k,m ≤ u k,m in which l k,m and u k,m are lower and upper bounds [40], respectively, and Bounded Variable with Constant Sum, i.e. −1 < w k,m < 1, K k=1 w k,m = 1 [41].In this study we simply constrain the weights between 0 and 1, i.e. 0 ≤ w k,m ≤ 1.By solving M different linear regression problems, we will get the optimal weight matrix W = {W m } M m=1 .Given an unsegmented image I test , it is segmented firstly by {h k } K k=1 to get the prediction matrices {P k (y m |I test )}(k = 1, ..., K, m = 1, ..., M ).Then the augmented data is created for I test by concatenating it with {P k (y m |I test )} as additional image channels.
in which P m (I * test (i, j)) and W m are defined as follows : Finally, the predicted class label is obtained by getting the label corresponding to the maximum value of class memberships: The combining and training procedure is described in Algorithm 1. Algorithm 1 receives inputs including training set D = {I n , Y n } N n=1 and segmentation algorithms {K k } K k=1 .Lines 2-7 create the probability matrices via T -fold crossvalidation procedure.Line 8 creates the augmented input data for the second layer via equations 2. Lines 10-14 create the second-level predictions for all training pixels L * via Tfold cross-validation procedure.Lines 16-20 find the optimal weight matrix via equation 7. Lines 21-24 train the segmentation models on the original training data and the augmented data respectively.Line 25 returns the trained models and the optimal weight matrix.
The testing procedure inputs an image I test , the trained models and the optimal weight matrix (see Algorithm 2).Lines 1-2 creates the probability matrix, while in line 3, the augmented input to the second layer is created by using equations 8. Lines 4-5 create second-level probability matrix from augmented input.Line 6-7 use equations 9, 12 and 13 to combine the second-level predictions of segmentation models by using the optimal weight matrix W. Finally line 8 returns the final segmentation result.

IV. EXPERIMENTAL STUDIES
In this experiment, we used UNet [9], LinkNet [42] and Feature Pyramid Network (FPN) [43], which are three popular segmentation architectures.The backbones used were VGG16 [15] and ResNet34 [16], pretrained on the ImageNet dataset [17].In total, there were 6 segmentation models used in the experiments.All segmentation algorithms were run for 300 epochs.The number of folds in the cross-validation procedure was set to 5. We compared the performance of the proposed ensemble to the 6 segmentation algorithms and one layer ensemble system with weights-based combiner, denoted by OLE in the tables.for k ← 1 to K do for I in Dt do 14:

A. Performance metrics
The performance of our proposed method and the related benchmarks were evaluated using two popular segmentation metrics.Suppose there are M classes, and there are N images each having size (W, H).Let P and G be the prediction of a segmentation model on these images and the corresponding ground truth: where p m is a vector with size (N × W × H, 1) associated with class label y m in which its element is the prediction for each pixel in the form of crisp label i.e. belonging to {0, 1}.Likewise, g m is a vector with size (N × W × H, 1) associated with class label y m in which each element which is the ground truth of each pixel in the form of crisp label i.e. belonging to {0, 1}.Dice coefficient for the m th class is then defined as follows [44]: In the context of medical image analysis, local discrepancies between contours are often of interest as well.For example, radiation treatment planning applications require quantified errors in geometric displacement to ensure target coverage, normal tissue avoidance, and similar analyses [45].We therefore reported one measure based on distance between geometrical contours.Let GT m and P R m be the set of coordinate vectors of the ground truth contour and prediction contour with respect to class y m respectively.The Hausdorff distance HD associated with class y m is calculated as follows [46] : where d(A, B) is the directed Hausdorff distance: It is noted that the low Hausdorff distance or high Dice coefficient shows the good segmentation result.

B. Kvasir-SEG dataset
The first dataset used in this paper is Kvasir-SEG [47], which consists of 1000 gastrointestinal polyp images, 200 of which is used for testing.The task is to segment the polyps in the images.Comparative evaluation of the segmentation models and the proposed method in Dice coefficient and Hausdorff distance is shown in Table I.The methods having VGG16 as backbone perform poorly, with Dice measure at just 0.0.In contrast, UNet-ResNet34, LinkNet-ResNet34 and FPN-ResNet34 achieve a Dice coefficient at 0.878, 0.879 and 0.887 respectively, while OLE achieves 0.888, which is roughly the same as FPN-ResNet34.The proposed method achieves a score of 0.892, which is an increase of 0.4% compared to the second best (OLE).For the Hausdorff distance, LinkNet-VGG16 has a very high score at around 271.7, while UNet-VGG16 achieves a score of 10.402 and FPN-VGG16 has a score of 0.0 (detect nothing).On the other hand, among the methods using ResNet34 backbone, UNet-ResNet34 has the highest Hausdorff score at 55.591, followed by LinkNet-ResNet34 at 51.241, FPN-ResNet34 at 50.321 and OLE at 49.38 .The proposed method achieves a Hausdorff distance of 48.831, which is better than the OLE by a difference of 0.55.
Figure 3 shows the result of six segmentation models, OLE, the proposed ensemble, the mask of test image and the original test image.The results made by methods using backbone VGG16 are not shown because they could not predict anything.All the segmentation algorithms segmented correctly the left part of the polyp.However, for the right part, UNet-ResNet34 and FPN-ResNet34 obtained a big hole in the lower and upper part respectively, while LinkNet-ResNet34 and OLE failed  to segment the right part.The proposed ensemble correctly segmented both the left and the right part of the polyp, with the exception of a relatively small hole in the middle.The reason of better performance of the proposed ensemble is that it takes into consideration information not only from the input image but also from the predictions in generating the segmentation models.
The proposed ensemble has higher training time than the benchmark algorithms.Compared with OLE which took about 2 days for training on this dataset, our two-layer ensemble trained for 4 days.In our training process, we solved Equation 7to find the combining weights.Even though the optimisation problem in Equation 7works on L * m matrix with 278528000 rows, it took only 5 minutes to find the weights by using sklearn library1 , which was the same as with OLE.Meanwhile, the testing time of proposed ensemble for 200 test images was 11 seconds, while OLE took 7 seconds.

C. CAMUS dataset
The second dataset used in this paper was the Cardiac Acquisitions for Multi-structure Ultrasound Segmentation (CA-MUS) dataset [48], which is a dataset provided by a competition for accurate segmentation of 2D echocardiographic Table II and III shows the result of the segmentation models and the proposed ensemble.We included the author's best results for each measure on this dataset [48].It can be seen that with respect to the Dice measure, the proposed method achieved best result on all cases.For the ED case, the proposed method achieved best result on the Myocardium and Left atrium class at 0.96 and 0.907, compared to the second best result at 0.959 and 0.9 of OLE respectively.On the Left ventricle class, the proposed method achieved the same result as the second best at 0.946.For the ES case, the proposed ensemble achieved roughly the same result as OLE on Left ventricle and Myocardium class at 0.93 and 0.955 respectively.However, on Left atrium class, the proposed method achieved a score of 0.934, which is better than the second best (OLE) at 0.929.The segmentation algorithms with VGG16 backbone performed very poorly on all cases, achieving only from 0.2 (LinkNet-VGG16 on Myocardium) to 0.307 (UNet-VGG16 on Left ventricle).
With the Hausdorff distance, the proposed ensemble beats the segmentation models in all classes for the ES case.It achieved 4.4 on the Left ventricle class while the second best among the segmentation models (LinkNet-ResNet34) achieved only 4.7 and OLE achieved 4.6.The same observation is on the Myocardium and Left atrium class.However, for the ED case, the proposed ensemble performed worse than LinkNet-VGG16, such as in the Myocardium class where the proposed method achieved a score of 5 while the LinkNet-VGG16 segmentation algorithms achieved 3.8, which is better by a 2 https://www.creatis.insa-lyon.fr/Challenge/camus/scientificInterests.htmlscore of 1.2.This can be explained from the observation in [45] in which it is possible for the Hausdorff distance to miscalculate when the curvature has a high degree of winding and low similarity.Figure 4 shows an example in which the proposed ensemble improved on the result of the segmentation models.While the predictions by the methods using VGG16 backbone (first row) contain a number of deformations compared to the test image, the predictions on the second row using ResNet34 backbone give better results.It can be seen that LinkNet-ResNet34 and FPN-ResNet34 failed to predict a large region in the bottom right of the Left atrium (second row, second and third column).On the other hand, while the prediction by UNet-ResNet34 is better than that of LinkNet-ResNet34 and FPN-ResNet34, it nevertheless contains a sharp inward region which was not correctly segmented.The proposed ensemble has improved upon the predictions by the constituent segmentation models as its prediction overall segment the bottom right part correctly.

V. CONCLUSION
In this paper, we presented a two-layer ensemble of deep learning models for segmentation of medical images.The key idea is to use the probability prediction by the constituent models in the first layer as augmented data for the second layer.The output probability prediction by the the second layer is combined by using a weight-based scheme which is not only a effective combiner but also computational efficient.The weights are found by solving a linear regression problem associated with each class label.Our results on two benchmark datasets show that the proposed ensemble method is able to combine the strengths and mitigate the drawbacks of the constituent segmentation methods, resulting in an overall improvement.
for final decision making.The training of combiner will conduct on the predictions for all pixels of training images in D * .Once again, the new training data D * is divided into disjoint parts {D * 1 , D * 2 , ..., D * T }.Then for each part D * t (t = 1, ..., T ), the segmentation algorithms {K k } K k=1 will learn on D * \ D *

Fig. 2 .
Fig. 2. Example of prediction results on CAMUS dataset.Top: Original image.Bottom is the predictions for Left ventricle, Myocardium and Left atrium classes, made by UNet and LinkNet with backbones ResNet34 and VGG16, respectively.The result has been multiplied by 255 for visualization.
The trained segmentation models of the second layer {h * k } K k=1 are then applied on I * test to get the prediction matrices {P k (y m |I * test )}(k = 1, ..., K, m = 1, ..., M ).The class memberships of an image pixel I * test (i, j) are found via linear combination of the prediction probabilities and the associated weights as:

Fig. 3 .
Fig. 3. Example result for Kvasir-SEG dataset.From left to right, top to bottom: UNet-ResNet34, LinkNet-ResNet34, FPN-ResNet34, OLE, proposed method, ground truth mask, and test image.The results made by segmentation algorithms using backbone VGG16 are not shown because they were not able to detect the polyps.

TABLE I KVASIR
-SEG RESULT FOR DICE AND HAUSDORFF MEASURE

TABLE II RESULT
FOR CAMUS DATASET, DICE MEASURE End Diastolic End Systolic Left ventricle Myocardium Left atrium Left ventricle Myocardium Left atrium

TABLE III RESULT
FOR CAMUS DATASET, HAUSDORFF MEASURE End Diastolic End Systolic Left ventricle Myocardium Left atrium Left ventricle Myocardium Left atrium