Generative multi-adversarial network for striking the right balance in abdominal image segmentation

Purpose The identification of abnormalities that are relatively rare within otherwise normal anatomy is a major challenge for deep learning in the semantic segmentation of medical images. The small number of samples of the minority classes in the training data makes the learning of optimal classification challenging, while the more frequently occurring samples of the majority class hamper the generalization of the classification boundary between infrequently occurring target objects and classes. In this paper, we developed a novel generative multi-adversarial network, called Ensemble-GAN, for mitigating this class imbalance problem in the semantic segmentation of abdominal images. Method The Ensemble-GAN framework is composed of a single-generator and a multi-discriminator variant for handling the class imbalance problem to provide a better generalization than existing approaches. The ensemble model aggregates the estimates of multiple models by training from different initializations and losses from various subsets of the training data. The single generator network analyzes the input image as a condition to predict a corresponding semantic segmentation image by use of feedback from the ensemble of discriminator networks. To evaluate the framework, we trained our framework on two public datasets, with different imbalance ratios and imaging modalities: the Chaos 2019 and the LiTS 2017. Result In terms of the F1 score, the accuracies of the semantic segmentation of healthy spleen, liver, and left and right kidneys were 0.93, 0.96, 0.90 and 0.94, respectively. The overall F1 scores for simultaneous segmentation of the lesions and liver were 0.83 and 0.94, respectively. Conclusion The proposed Ensemble-GAN framework demonstrated outstanding performance in the semantic segmentation of medical images in comparison with other approaches on popular abdominal imaging benchmarks. The Ensemble-GAN has the potential to segment abdominal images more accurately than human experts.


Introduction
One of the major challenges of deep learning for medical image analysis is the highly skewed class distribution of objects in medical images, which is referred to as the imbalanced classification problem. An imbalanced classification problem occurs when the target classes of a dataset have a highly unequal number of samples. For example, in a binary classification, the imbalanced classification problem occurs when the number of samples representing a specific disease has fewer observations than the healthy class. The former is called an infrequent class or minority class, whereas the latter is called a majority class. Because canonical machine learning assumes that different categories have similar numbers of samples, a model trained on such imbalanced data distribution will be biased toward the most frequent class, which is not desirable in clinical applications.
In this work, we mitigate the negative impact of the class imbalance problem through ensemble learning of discriminative convolutional neural networks. By combining multiple networks that are individually complementary, one can obtain a compound classifier that is more accurate than any of its base components [1]. Here, we propose an architecture based on a generative multi-adversarial network, called Ensemble-GAN, which is composed of a generator and an ensemble of discriminators. We implemented the generator network in a multi-discriminator setting through simultaneous minimization of different losses to minimize the prediction error of the generator model as a multi-objective optimization problem. The discriminators were varied by use of different feature maps, different losses, and initializations. Moreover, we developed methods for providing more accurate semantic segmentation of high-resolution medical images than existing approaches.
To demonstrate the generalization ability of our approach, we evaluated the performance of the Ensemble-GAN in semantic segmentation of organs and tumor regions from abdominal computed tomography (CT) and magnetic resonance (MR) images by use of a highly imbalanced training dataset where the number of pixels belonging to abnormal regions of interest was much smaller than that of normal regions. The results demonstrated the generalization ability of our approach in the segmentation of body organs and tumor regions.
The rest of the paper is organized as follows: "Related work" section presents an overview of the most recent approaches to the imbalanced classification problem and semantic segmentation of medical images. "Methoda" section explains the proposed approach for learning the class imbalance problem. The experimental design and results are presented in "Experimental design" and "Results" sections, respectively, followed by the discussions and conclusions in "Discussions and conclusions" section.

Related work
This section provides a brief review of the most recent stateof-the-art approaches carried out on the topics of learning from imbalanced data, multi-objective training of generative adversarial networks (GANs), and medical image segmentation.

Learning from imbalanced data
In medical image analysis, the most popular strategies for addressing the imbalanced classification problem have included data-level methods and algorithmic methods. The data-level methods include under-sampling or over-sampling of the training dataset. However, these resampling approaches often remove some of the important samples or they add redundant samples to the training data. Algorithmic methods have included cost-sensitive learning and ensemble learning. The cost-sensitive learning is typically used with accuracy loss [2], Dice coefficient loss [3], and asymmetric similarity loss [4] to modify the distribution of the training data based on a mis-classification cost. However, in the case of image segmentation, losses such as mean surface distance or Hausdorff surface distance are more appropriate. Most of the imbalanced ensemble techniques apply majority voting [5] or average voting [6] with a combination of losses and different initializations. The trade-off from the bias and variance of combining a redundant ensemble was studied by Sun et al. [7]. Because the ensemble model reduced the variance on test data, the prediction result for the minority class was improved [7].

Multi-objective training of GANs
Recently, variants of GAN models have included multiple generators and/or multiple discriminators to tackle the problems of mode collapse, global optimization, and nonconvergence of conventional GANs. Durugkar et al. [8] introduced a generator with multiple discriminators, where the average or maximum of discriminator losses provides feedback to the generator. In another study, a generator was trained with a set of discriminators where each discriminator classified a fixed random projection of the inputs [6,9]. In contrast, the MGAN [10] and MAD-GAN [11] schemes proposed GAN-based architectures with multiple generators and single discriminator, while the MD-GAN [12] introduced a distributed GAN composed of four generators and four discriminators. Sathish et al. [13] (IITKGP-KLIV) performed adversarial learning composed of two auxiliary classifiers and one discriminator with application to medical image segmentation.

Semantic image segmentation
Recent studies on deep learning for semantic segmentation of images have differed mostly in terms of their architectural design for linking different parts of the image to reveal relationships between the objects. Examples include the DeepLabv3+ [14] framework which used an encoderdecoder structure with the separable atrous convolution composed of a depth-wise convolution (spatial convolution for each channel of the input) and point-wise convolution (1×1 convolution with the depth-wise convolution as input). Pham et al. [13] (ISUDE) proposed an hourglass autoencoder with DICE loss for abdominal segmentation, and a modified U-Net architecture that was substituted with an attention mechanism [13] (OvGUMEMoRIAL) showed successful results for semantic segmentation of abdominal images.

Conditional GAN
In a conventional GAN, a generative model G learns a mapping from a random noise vector z to an output image y; G : z → y. Meanwhile, a discriminative model D estimates the probability of a sample coming from the training data (x real ) rather than from the generator (x f ake ). The objective function is a two-player mini-max game that can be formulated as In a conditional GAN (cGAN), a generative model learns a mapping from the random noise vector z and an observed image x to an output image y; G : x, z → y. The discriminative model attempts to discriminate between the ground truth of the training set and the generator output as in a conventional GAN. The objective function conditions both G and D on the desired output y: In both conventional and conditional GAN frameworks, the task of a discriminator is much harder than that of the generator as it has to minimize the mistakes of the generator. Along with the mini-max nature of the objective, this raises several challenges such as mode collapse, vanishing gradient, and failure to converge. In this work, we propose a new framework to address the learning of an unbiased model on the class imbalance problem. Figure 1 illustrates the architecture of our Ensemble-GAN where all components are parameterized by neural networks. The proposed framework consists of single-generator and multi-discriminator variants that attempt to better approximate max V (G, D k ), providing a better critic to the generator. Here, the generator learns from the feedback, aggregated over multiple discriminators either by K k=1 V (G, D k ). The main idea of combining multiple discriminators in redundant ensembles is (1) to improve the generalization ability since each discriminator covers only some parts of the application data, (2) to combine multiple discriminators into a single consensus model (as maximum, average, or sum), which performs better than a single discriminator because the patterns that are misclassified by different discriminators are not the same, and (3) to overcome typical defects of vanilla conditional GANs, such as global structure collapse and local detail ambiguity by designing a new architecture for the generator.

Ensemble-GAN
In our workflow, the generator G is forced to learn to minimize the prediction error of semantic segmentation through the ensemble of discriminators. This ultimately encourages G to produce conditional samples with minimum error, since G needs to fool the different possible discriminators. Variations in the ensemble are achieved by the summation feedback of each D with a certain probability at the end of every batch. Therefore, G considers the sum of discriminator losses in the ensemble while updating its parameters at each iteration.
Similar to Luo et al. [15], the extracted local and global output by a single generator is passed into two individual discriminators. We designed and implemented different architectures with various losses based on our study. Increasing the number of discriminators (1) with different losses covers more aspects of the generator's output by approximat- (2) with different representations of the data, they are also capable of better catching the distributions of the generator.

The objective function
We formulate the proposed Ensemble-GAN with a cohort of three networks (see Fig. 1). Extension of the framework with more networks is discussed in "Experiments" section. Here, a single generator attempts to minimize the segmentation error regarding an ensemble of k different losses. The generator takes a random vector z and medical images x as input, whereas the discriminators attempt to minimize the error of predicting the segmentation masks produced by the generator through multiple losses. For a fixed G, function F will receive sum of k different discriminator losses to the generator through the objective of The use of the proposed combination scheme prioritizes the worst discriminators and thus provides more useful gradients to the generator during the training. Details about the architectural choices, discriminator and generator losses, and Sample mini-batch of noise samples z from noise prior p g (z) 6 Sample from data generating distribution p data (x, y) 7 Update and fine-tune Sum of k discriminators as follows: 8 Maximize E x,y∼ px,y [log D k (x, y)] + E z∼ pz ,y∼ py [log(1 − D k (G(z, y); y))] 9 end 10 for i = 1, G steps do 11 Sample mini-batch, noise samples z, from noise prior p g (z) 12 Update and fine-tune the generator as follows: 13 Minimize E z∼ pz ,y∼ py [log(1 − D k (G(z, y); y))] 14 end 15 end the selection of the hyperparameter λ are discussed in "Experiments" section.

Materials
We validated the performance of our proposed Ensemble-GAN based on clinical patient data from two recent, publicly available challenge datasets in abdominal imaging: (1) the automated liver and tumor segmentation (LiTS) 1 of MIC-CAI 2017 conference and (2) the segmentation challenge 1 https://chaos.grand-challenge.org/.
(CHAOS) 2 of the ISBI 2019 conference. Both datasets consist of abdominal CT and MR images for which each image slice has been manually segmented by expert radiologists.

CHAOS
The CHAOS challenge [16] is a Combined (CT-MR) Healthy Abdominal Organ Segmentation problem that has been organized into different segmentation tasks. In this study, we evaluated our model on the segmentation of abdominal organs (CT + MRI as a task (4)). The dataset included 20 MR and 20 CT abdominal images with five segmentation labels for the liver, spleen, left kidney, right kidney, and background. We trained our model on a total of 16,266 2D images

LiTS
In the second experiment, we employed the LiTS-2017 dataset that contains 130 training and 70 test CT cases, in which patients were suffered from different types of liver cancers. The challenge was to perform a simultaneous semantic segmentation of a large liver that had a 1:400 imbalanced class ratio of pixels representing the liver and surrounding tissue with an abnormal target region with 1:1400 imbalanced class ratio between pixels representing abnormal and normal tissue.

Experiments
We evaluated three architectural choices for the proposed Ensemble-GAN. The first experiment, Ensemble-GAN (1), included a single generator and two discriminators. As shown in Fig. 1, the generator had a stacked hourglass network design [17] which provides a mechanism for repeated bottom-up and top-down inference, allowing for a reevaluation of the initial estimates and features across the whole image. The architecture of the discriminator was similar to a Markovian discriminator [18] to restrict the attention to the structure in local image patches. The discriminator losses were mae and Dice . For the hyperparameters, we set λ 1 = 10 and λ 2 = 5 for D 1 and D 2 , respectively. We used a network pretrained with ImageNet for the initialization of the weights of the discriminators, but we trained the genera- tor from scratch using a Gaussian distribution with a standard deviation of 0.001. The learning rate started from 0.0002 with a mini-batch size of 1. We used Adam [19] as the optimizer and set β 1 = 0.9, β 2 = 0.999 with a weight decay of 0.0001. We used the binary cross-entropy as the adversarial loss in all experiments. The second experiment, Ensemble-GAN (2), included a single generator and three discriminators. The generator and discriminator networks had the same architecture as those of Ensemble-GAN (1). Here, we explored the effect of three discriminator losses on the outcome of the generator. We combined and added a categorical cross-entropy loss cce as a third loss with λ 3 = 25. In this architecture, categorical cross-entropy calculates differentiation between the high-resolution feature map by last layer of first auto-encoder network and ground-truth images. In the third experiment, Ensemble-GAN (3), three different outputs of a single generator were passed to three different discriminator losses. The generator and discriminator networks had the same architecture as those of Ensemble-GAN (1). We passed the second bottleneck and the last fully convolutional layer of each auto-encoder from generator separately as the output of the deep feature tensor and transferred them with the label map into three different discriminators. The feature vector of the bottleneck represents the local information of images, whereas the last fully convolutional layer contains global features. The combination of additional losses and the adversarial loss is controlled by a λ hyperparameter, which controls the relative importance of each loss. Here, we used two categorical cross-entropy losses for the two different generator outputs: λ 1 = 100, λ 2 = 25 and λ 2 = 1 for high-resolution cce , low-resolution cce , and adversarial loss adv , respectively. Table 5 represents the effectiveness of λ in semantic segmentation in terms of F1 score. Figure 2 shows the training losses at the beginning and after 100 epochs.
We implemented the Ensemble-GAN on top of Macro-Micro GANs [15]. We used all 2D slices from the axial view with size 256×256 for the CHAOS dataset and 512×512 for the LiTS dataset. For data augmentation, we applied random cropping, mirroring, scaling, enhancement, and [−10, +10] degree random rotation in all the experiments. The networks were trained on a workstation equipped with five Nvidia Titan X GPUs.

Evaluation criteria
The evaluation and comparison of the Ensemble-GAN were performed using the quality metrics introduced by each challenge organizer. We evaluated the performance of the proposed method with the F1 score and precision-recall as a measure for handling the imbalanced issue.
For the LiTS competition, the primary metric was the Dice score. A volume overlap error (VOE), relative volume dif-ference (RVD), average symmetric surface distance (ASSD), and maximum symmetric surface distance (MSSD) were considered for the evaluation of the predicted region of the liver and the liver lesions. Tables 1 and 2 describe the quantitative results and comparisons with top-ranked methods from the LiTS leaderboard. 3 Among the four metrics determined by the CHAOS organizer for evaluating the multi-organ segmentation, 4 Dice coefficient, average symmetric surface distance (ASSD), relative volume difference (RVD), and maximum symmetric surface distance (MSSD) were utilized to determine the potential over-and under-segmentation boundaries.

Results
To understand the performance gains, we analyzed the accuracy on the imbalanced liver tumor segmentation dataset, where we can see the unbalanced labels between the large organs and very small lesions. Based on the leaderboard, most of the top-ranked models used cascade networks to segment the liver and the lesions simultaneously [20] or separately [21,22]. The generative ensemble networks provided a good solution against the imbalanced labeling. Table 1 shows the Dice scores for the liver and lesion segmentation. The highest scores obtained by our proposed framework were 0.94 for the liver and 0.84 for the lesions. Based on a comparison of the first two rows of Table 1, we find that the effect of the ensemble of discriminators on the final result increased up to 9% for the liver segmentation and up to 3% for the segmentation of lesions.
For the LiTS dataset, lesions with an approximate diameter equal to or larger than 10 mm were defined as large, while lesions with a diameter of less than 10 mm were defined as small. Our method achieved an average Dice score of 0.91    Fig. 3. We used a fivefold cross-validation for the training due to the different intensity distributions of the cases. In the test phase, the voted average of these models was used for making a prediction for each case in the test dataset.
The top three rows of Table 3 show the quantitative results achieved by the different Ensemble-GAN architectures. According to Table 3 and Fig. 4, the predicted semantic segmentation by Ensemble-GAN (3) outperformed the other architectures, which demonstrates the success of passing the dual output of the generator as global and one output of local feature vectors into three individual pretrained discriminators. The local features include more details on edges, while the global features contain more localized features. Having two adversarial losses for global and local discriminators besides binary cross-entropy of generative model leads to a better recognition and smoother boundaries of segmentation in both benchmarks than with the other approaches. As reported by CHAOS [13], the three top-ranked teams in task 4 used different deep ensemble discriminator networks such as cascade architectures and therefore reported a more stable result in the test phase. The achieved Cohen's kappa scores by Ensemble-GAN (1-3) on CHAOS dataset are 0.77, 0.78, and 0.8, respectively, where the average kappa score is 0.64 by the conditional GAN.  The F1 scores obtained across 100 epochs on both datasets are shown in the table Figure 5 and Table 4 represent and compare qualitative and quantitative results achieved by different Ensemble-GAN setting and configuration.
The results showed ( Table 5) that choosing larger λ can generate more accurate semantic segmentation images. The adversarial loss influences if the generator model can output images that are acceptable in the target domain. Therefore, the combination of other losses and adversarial loss regularizes the generator model to output images that are an acceptable translation of the source image. We controlled the impact of additional losses by a λ hyperparameter, where set to 10 means giving ten times importance of L mae loss than the adversarial loss during training and testing. To explore effect of hyperparameter of λ in the task of medical image semantic segmentation, we did several experiments shown in Table 5 in terms of F1 score.