1 Introduction

With the introduction of fully convolutional neural networks (CNNs), deep learning has raised the benchmark for medical image segmentation on both speed and accuracy [7]. Different 2D [5, 8] and 3D [1, 2, 6, 10] networks were proposed to segment various anatomies such as the heart, brain, liver, and prostate from medical images. Regardless of the promising results of these networks, 3D CNN image segmentation is still challenging. Most networks were applied on datasets with small numbers of labels (<10) especially in 3D segmentation. When more detailed segmentation is required with much more anatomical structures, previously unseen issues, such as computational feasibility and highly unbalanced object sizes, need to be addressed by new network architectures and algorithms.

There are only a few frameworks proposed for highly unbalanced labels. In [8], a 2D network architecture was proposed to segment all slices of a 3D brain volume. Error corrective boosting was introduced to compute label weights that emphasize parameter updates on classes with lower validation accuracy. Although the results were promising, the label weights were only applied to the weighted cross-entropy but not the Dice loss, and the stacking of 2D results for 3D segmentation may result in inconsistency among consecutive slices.

In [9], the generalized Dice loss was used as the loss function. Instead of computing the Dice loss of each label, the weighted sum of the products over the weighted sum of the sums between the ground-truth and predicted probabilities was computed for the generalized Dice loss, with the weights inversely proportional to the label frequencies. In fact, the Dice coefficient is unfavorable to small structures as a few pixels of misclassification can lead to a large decrease of the coefficient, and this sensitivity is irrelevant to the relative sizes among structures. Therefore, balancing by label frequencies is nonoptimal for Dice losses.

To address the issues of highly unbalanced object sizes and computational efficiency in 3D segmentation, we have two key contributions in this paper. (I) We propose the exponential logarithmic loss function. In [4], to handle the highly unbalanced dataset of a two-class image classification problem, a modulating factor computed solely from the softmax probability of the network output is multiplied by the weighted cross-entropy to focus on the less accurate class. Inspired by this concept of balancing classification difficulties, we propose a loss function comprising the logarithmic Dice loss which intrinsically focuses more on less accurately segmented structures. The nonlinearities of the logarithmic Dice loss and the weighted cross-entropy can be further controlled by the proposed exponential parameters. In this manner, the network can achieve accurate segmentation on both small and large structures. (II) We propose a fast converging and computationally efficient network architecture by combining the advantages of skip connections and deep supervision, which has only about 1/14 of the parameters of, and is twice as fast as, the V-Net [6]. Experiments were performed on brain magnetic resonance (MR) images with 20 highly unbalanced labels. Combining these two innovations achieved an average Dice coefficient of 82% with the average segmentation time as 0.4 s.

2 Methodology

2.1 Proposed Network Architecture

3D segmentation networks require much more computational resources than 2D networks. Therefore, we propose a network architecture which aims at accurate segmentation and fast convergence with respect to limited resources (Fig. 1). Similar to most segmentation networks, our network comprises the encoding and decoding paths. The network is composed of convolutional blocks, each comprises k cascading \(3 \times 3\times 3\) convolutional layers of n channels associated with batch normalization (BN) and rectified linear units (ReLU). For better convergence, a skip connection with a \(1 \times 1 \times 1\) convolutional layer is used in each block. Instead of concatenation, we add the two branches together for less memory consumption, so the block allows efficient multi-scale processing and deeper networks can be trained. The number of channels (n) is doubled after each max pooling and is halved after each upsampling. More layers (k) are used with tensors of smaller sizes so that more abstract knowledge can be learned with feasible memory use. Feature channels from the encoding path are concatenated with the corresponding tensors in the decoding path for better convergence. We also include a Gaussian noise layer and a dropout layer to avoid overfitting.

Similar to [5], we utilize deep supervision which allows more direct backpropagation to the hidden layers for faster convergence and better accuracy [3]. Although deep supervision significantly improves convergence, it is memory expensive especially in 3D networks. Therefore, we omit the tensor from the block with the most channels (Block(192, 3)) so that training can be performed on a GPU with 12 GB of memory. A final layer of \(1 \times 1 \times 1\) convolution with the softmax function provides the segmentation probabilities.

Fig. 1.
figure 1

Proposed network architecture optimized for 3D segmentation. Blue and white boxes indicate operation outputs and copied data, respectively.

2.2 Exponential Logarithmic Loss

We propose a loss function which improves segmentation on small structures:

$$\begin{aligned} L_\mathrm {Exp} = w_{\mathrm {Dice}} L_\mathrm {Dice} + w_\mathrm {Cross} L_\mathrm {Cross} \end{aligned}$$
(1)

with \(w_{\mathrm {Dice}}\) and \(w_\mathrm {Cross}\) the respective weights of the exponential logarithmic Dice loss (\(L_\mathrm {Dice}\)) and the weighted exponential cross-entropy (\(L_\mathrm {Cross}\)):

$$\begin{aligned}&L_\mathrm {Dice} = \mathbf {E}\left[ \left( - \ln (\mathrm {Dice}_i) \right) ^{\gamma _\mathrm {Dice}}\right] \ \text {with Dice}_i = \frac{2 \left( \sum \nolimits _\mathbf {x} \delta _{il}(\mathbf {x})\ p_i(\mathbf {x})\right) + \epsilon }{\left( \sum \nolimits _\mathbf {x} \delta _{il}(\mathbf {x}) + p_i(\mathbf {x})\right) + \epsilon } \end{aligned}$$
(2)
$$\begin{aligned}&\quad \qquad L_\mathrm {Cross} = \mathbf {E}\left[ w_l \left( - \ln (p_l(\mathbf {x}))\right) ^{\gamma _\mathrm {Cross}} \right] \quad \end{aligned}$$
(3)

with \(\mathbf {x}\) the pixel position and i the label. l is the ground-truth label at \(\mathbf {x}\). \(\mathbf {E}[\bullet ]\) is the mean value with respect to i and \(\mathbf {x}\) in \(L_\mathrm {Dice}\) and \(L_\mathrm {Cross}\), respectively. \(\delta _{il}(\mathbf {x})\) is the Kronecker delta which is 1 when \(i=l\) and 0 otherwise. \(p_{i}(\mathbf {x})\) is the softmax probability which acts as the portion of pixel \(\mathbf {x}\) owned by label i when computing \(\mathrm {Dice}_i\). \(\epsilon = 1\) is the pseudocount for additive smoothing to handle missing labels in training samples. \(w_{l} = \left( (\sum \nolimits _k f_k)/{f_l}\right) ^{0.5}\), with \(f_k\) the frequency of label k, is the label weight for reducing the influences of more frequently seen labels. \(\gamma _\mathrm {Dice}\) and \(\gamma _\mathrm {Cross}\) further control the nonlinearities of the loss functions, and we use \(\gamma _\mathrm {Dice} = \gamma _\mathrm {Cross} = \gamma \) here for simplicity.

The use of the Dice loss in CNN was proposed in [6]. The Dice coefficient is unfavorable to small structures as misclassifying a few pixels can lead to a large decrease of the coefficient. The use of label weights cannot alleviate such sensitivity as it is irrelevant to the relative object sizes, and the Dice coefficient is already a normalized metric. Therefore, instead of size differences, we use the logarithmic Dice loss which focuses more on less accurate labels. Figure 2 shows a comparison between the linear (\(\mathbf {E}\left[ 1 - \mathrm {Dice}_i\right] \)) and logarithmic Dice loss.

We provide further control on the nonlinearities of the losses by introducing the exponents \(\gamma _\mathrm {Dice}\) and \(\gamma _\mathrm {Cross}\). In [4], a modulating factor, \((1 - p_l)^\gamma \), is multiplied by the weighted cross-entropy to become \(w_l(1 - p_l)^\gamma (-\ln (p_l))\) for two-class image classification. Apart from balancing the label frequencies using the label weights \(w_l\), this focal loss also balances between easy and hard samples. Our exponential loss achieves a similar goal. With \(\gamma > 1\), the loss focuses more on less accurate labels than the logarithmic loss (Fig. 2). Although the focal loss works well for the two-class image classification in [4], we got worse results when applying to our segmentation problem with 20 labels. This may be caused by the over suppression of the loss function when the label accuracy becomes high. In contrast, we could get better results with \(0< \gamma < 1\). Figure 2 shows that when \(\gamma = 0.3\), there is an inflection point around \(x = 0.5\), where x can be \(\mathrm {Dice}_i\) or \(p_l(\mathbf {x})\). For \(x < 0.5\), this loss behaves similarly to the losses with \(\gamma \ge 1\) with decreasing gradient magnitude as x increases. This trend reverses for \(x > 0.5\) with increasing gradient magnitude. In consequence, this loss encourages improvements at both low and high prediction accuracy. This characteristics is the reason of using the proposed exponential form instead of the one in [4].

Fig. 2.
figure 2

Loss functions with different nonlinearities, where x can be \(\mathrm {Dice}_i\) or \(p_l(\mathbf {x})\).

2.3 Training Strategy

Image augmentation is used to learn invariant features and avoid overfitting. As realistic nonrigid deformation is difficult to implement and computationally expensive, we limit the augmentation to rigid transformations including rotation (axial, \({\pm }30^{\circ }\)), shifting (±20%), and scaling ([0.8, 1.2]). Each image has an 80% chance to be transformed in training, thus the number of augmented images is proportional to the number of epochs. The optimizer Adam is used with the Nesterov momentum for fast convergence, with the learning rate as 10\(^{-3}\), batch size as one, and 100 epochs. A TITAN X GPU with 12 GB of memory is used.

3 Experiments

3.1 Data and Experimental Setups

A dataset of 43 3D brain MR images from different patients was neuroanatomically labeled to provide the training and validation samples. The images were produced by the T1-weighted MP-RAGE pulse sequence which provides high tissue contrast. They were manually segmented by highly trained experts with the results reviewed by a consulting neuroanatomist. Each segmentation had 19 semantic labels of brain structures, thus 20 labels with the background included (Table 1(a)). As there were various image sizes (128 to 337) and spacings (0.9 to 1.5 mm), each image was resampled to isotropic spacing using the minimum spacing, zero padded on the shorter sides, and resized to \(128 \times 128 \times 128\).

Table 1(a) shows that the labels were highly unbalanced. The background occupied 93.5% of an image on average. Without the background, the relative sizes of the smallest and largest structures were 0.07% and 50.24%, respectively, thus a ratio of 0.14%.

We studied six loss functions using the proposed network, and applied the best one to the V-Net architecture [6], thus a total of seven cases were studied. For \(L_\mathrm {Exp}\), we set \(w_{\mathrm {Dice}} = 0.8\) and \(w_{\mathrm {Cross}} = 0.2\) as they provided the best results. Five sets of data were generated by shuffling and splitting the dataset, with 70% for training and 30% for validation in each set. Experiments were performed on all five sets of data for each case studied for more statistically sound results. The actual Dice coefficients, not the \(\mathrm {Dice}_i\) in (2), were computed for each validation image. Identical setup and training strategy were used in all experiments.

Table 1. Semantic brain segmentation. (a) Semantic labels and their relative sizes on average (%) without the background. CVL represents cerebellar vermal lobules. The background occupied 93.5% of an image on average. (b) Dice coefficients between prediction and ground truth averaged from five experiments (format: mean ± std%). The best results are highlighted in blue. \(w_{\mathrm {Dice}} = 0.8\) and \(w_{\mathrm {Cross}} = 0.2\) for all \(L_\mathrm {Exp}\).

3.2 Results and Discussion

Table 1(b) shows the Dice coefficients averaged from the five experiments. The linear Dice loss (\(\mathbf {E}\left[ 1 - \mathrm {Dice}_i\right] \)) had the worst performance. It performed well with the relatively large structures such as the gray and white matters, but the performance decreased with the sizes of the structures. The very small structures, such as the nucleus accumbens and amygdala, were missed in all experiments. In contrast, the logarithmic Dice loss (\(L_\mathrm {Dice}(\gamma = 1)\)) provided much better results, though the large standard deviation of label 2 indicates that there were misses. We also performed experiments with the weighted cross-entropy (\(L_\mathrm {Cross}(\gamma = 1)\)), whose performance was better than the linear Dice loss but worse than the logarithmic Dice loss. The weighted sum of the logarithmic Dice loss and weighted cross-entropy (\(L_\mathrm {Exp}(\gamma = 1)\)) outperformed the individual losses, and it provided the second best results among the tested cases. As discussed in Sect. 2.2, \(L_\mathrm {Exp}(\gamma = 2)\) was ineffective even on larger structures. This is consistent with our observation in Fig. 2 that the loss function is over suppressed when the accuracy is getting higher. In contrast, \(L_\mathrm {Exp}(\gamma = 0.3)\) gave the best results. Although it only performed slightly better than \(L_\mathrm {Exp}(\gamma = 1)\) in terms of the means, the smaller standard deviations indicate that it was also more precise.

Fig. 3.
figure 3

Validation Dice coefficients vs. epoch, averaged from five experiments.

Fig. 4.
figure 4

Visualization of an example. Top: axial view. Bottom: 3D view with the cerebral grey, cerebral white, and cerebellar grey matters hidden for better illustration.

When applying the best loss function to the V-Net, its performance was only better than the linear Dice loss and \(L_\mathrm {Exp}(\gamma = 2)\). This shows that our proposed network architecture performed better than the V-Net on this problem.

Figure 3 shows the validation Dice coefficients vs. epoch, averaged from the five experiments. Instead of the losses, we show the Dice coefficients as their magnitudes were consistent among cases. Similar to Table 1(b), the logarithmic Dice loss, \(L_\mathrm {Exp}(\gamma = 1)\), and \(L_\mathrm {Exp}(\gamma = 0.3)\) had good convergence and performance, with \(L_\mathrm {Exp}(\gamma = 0.3)\) performed slightly better. These three cases converged at about 80 epochs. The weighted cross-entropy and \(L_\mathrm {Exp}(\gamma = 2)\) were more fluctuating. The linear Dice loss also converged at about 80 epochs but with a much smaller Dice coefficient. Comparing between the V-Net and the proposed network with \(L_\mathrm {Exp}(\gamma = 0.3)\), the V-Net had worse convergence especially at the earlier epochs. This shows that the proposed network had better convergence.

Figure 4 shows the visualization of an example. There are two obvious observations. First of all, consistent with Table 1(b), the linear Dice loss missed some small structures such as the nucleus accumbens and amygdala, though it performed well on large structures. Secondly, the segmentation of the V-Net deviated a lot from the ground truth. The logarithmic Dice loss, \(L_\mathrm {Exp}(\gamma = 1)\), and \(L_\mathrm {Exp}(\gamma = 0.3)\) had the best segmentations and average Dice coefficients. The weighted cross-entropy had the same average Dice coefficient as \(L_\mathrm {Exp}(\gamma = 2)\), though the weighted cross-entropy over-segmented some structures such as the brainstem, and \(L_\mathrm {Exp}(\gamma = 2)\) had a noisier segmentation.

Comparing the efficiencies between the proposed network and the V-Net, the proposed network had around 5 million parameters while the V-Net had around 71 million parameters, a 14-fold difference. Furthermore, the proposed network only took about 0.4 s on average to segment a \(128 \times 128 \times 128\) volume, while the V-Net took about 0.9 s. Therefore, the proposed network was more efficient.

4 Conclusion

In this paper, we propose a network architecture optimized for 3D image segmentation, and a loss function for segmenting very small structures. The proposed network architecture has only about 1/14 of the parameters of, and is twice as fast as, the V-Net. For the loss function, the logarithmic Dice loss outperforms the linear Dice loss, and the weighted sum of the logarithmic Dice loss and the weighted cross-entropy outperforms the individual losses. With the introduction of the exponential form, the nonlinearities of the loss functions can be further controlled to improve the accuracy and precision of segmentation.