1 Introduction

Convolutional neural networks (CNN) have been used for semantic segmentation on medical images with great success [2]. For the most part, these methods rely on fully annotated images to train the network. Although CNN-based segmentation algorithms keep evolving and improving, the amount of available training data still has a substantial effect on the performance [9]. However, it is difficult to obtain large scale fully annotated data for medical images since it requires an expert to spend considerable time and effort.

To address this limitation, a number of works have proposed interactive image segmentation methods relying on weak annotations such as bounding boxes [12], or scribbles [4, 6]. However, in these works, the annotations need to be provided for each new test image. Recently, a number of works have demonstrated that it is feasible to train fully-automatic, learning-based algorithms using exclusively weak labels [5, 9,10,11]. Despite being trained on weak labels, these methods can produce full segmentation masks on test images. Of the above works only [11] was demonstrated on medical images. The authors proposed to train a segmentation network for fetal structures from bounding box annotations only.

In this paper we present a scribble-based weakly-supervised learning framework for medical images. Scribbles have been recognized as particularly user-friendly form of supervision [9] and may be better suited for nested structures, when compared to bounding boxes. Furthermore, they require only a fraction of the annotation time compared to full pixel-wise annotations. Following previous works, the proposed framework is an iterative two-step procedure in which a segmentation network is trained on the scribble annotations, then this network is used in conjunction with a conditional random field (CRF) to relabel the training set. This in turn is used for an additional training recursionFootnote 1. We show that this procedure, under some assumptions, can be interpreted as expectation maximization (EM). We investigate multiple strategies for relabeling the training dataset, estimating the CRF parameters, and quantifying uncertainty in the relabeling step. An overview of the method is shown in Fig. 1.

We evaluate the framework and its individual components on the public cardiac ACDC dataset [2] and the NCI-ISBI 2013 prostate segmentation challenge data [3]. We show that despite the inherently very sparse nature of the annotations the proposed methods achieve a segmentation accuracy within 95% of a baseline network trained with full supervision. To our knowledge, this is the first demonstration of training a pixel-wise segmentation network with scribble supervision on medical image data.

Fig. 1.
figure 1

Overview of the proposed training framework.

2 Methods

The aim of our proposed method is to learn the parameters \(\theta \) of a CNN-based segmentation network \(\mathbf {y} = f(\mathbf {x}; \theta )\) such that it predicts a generally unknown segmentation mask \(\mathbf {y} \in \{0,\ldots ,L\}^N\) for an input image \(\mathbf {x}\in \mathbb {R}^N\), where N is the number of pixels. During training, rather than full pixel-wise annotations, we are only provided with a ground truth annotation \(\mathbf {\xi }\) for a small number of pixels (i.e. the scribbles). Note that this also includes a background scribble (see examples in Fig. 2). The proposed framework consists of a repeated estimation of the network parameters and subsequent relabeling of the training dataset by combining the network prediction with a CRF. We investigate two different CRF inference strategies: the dense CRF approach proposed in [8], and a recent extension thereof in which the CRF is formulated as a recurrent neural network (RNN) and the CRF parameters can be learned end-to-end [13]. Moreover, we investigate a novel strategy for incorporating prediction uncertainty in the relabeling step based on [7]. For all investigated strategies we perform an initial region growing step described in the following.

Fig. 2.
figure 2

Example images and scribbles on the left and ground truth segmentations on the right for the (a) prostate and (b) cardiac datasets, respectively.

2.1 Generation of Seed Areas by Region Growing

For this step we use the random walk-based segmentation method proposed by [6], which (similar to neural networks) produces a pixel-wise probability map for each label. We assign each pixel its predicted value only if the probability exceeds a threshold \(\tau \). Otherwise the pixel-label is treated as unknown. An example of this step can be seen in Fig. 1. Note that the threshold is intentionally chosen very high such as to underestimate the true extent of the structures and only include pixels which have a very high probability of being correctly estimated. Those assignments will serve as new “ground truth” labels \(\mathbf {\hat{z}}\) for the remainder of the steps and will be referred to as seed areas. The uncertain pixels \(\mathbf {z}\) are treated as unlabeled, i.e. they are the latent variables of our model.

2.2 Separate CRF and Network Training

We propose a hard expectation maximization (EM) approximation to learn the network parameters \(\theta \) in an iterative fashion. The algorithm consists of alternatingly estimating the best parameters of the neural network given a labeling obtained using the current parameters \(\theta ^{old}\) (M step), and estimating the optimal labeling of the latent variables given an updated \(\theta \) (E step). We assume the following graphical model

$$\begin{aligned} p(\mathbf {z}, \mathbf {\hat{z}} | \mathbf {x}, \theta ) = p(\mathbf {z} | \mathbf {x}, \theta )p(\mathbf {\hat{z}}|\mathbf {z}, \mathbf {x}), \end{aligned}$$
(1)

where \(p(\mathbf {z | \mathbf {x}, \theta })\) is modeled using a neural network \(f(\mathbf {x};\theta )\). Following the standard EM approach, we write the expectation of the complete-data log likelihood as

$$\begin{aligned} Q(\theta , \theta ^{old}) = \sum _\mathbf {z} p(\mathbf {z} | \mathbf {\hat{z}}, \mathbf {x}, \theta ^{old}) \ln p(\mathbf {z}, \mathbf {\hat{z}} | \mathbf {x}, \theta ). \end{aligned}$$
(2)

In the E step of the algorithm we estimate the mode of \( p(\mathbf {z} | \mathbf {\hat{z}}, \mathbf {x}, \theta ^{old}) \) as

$$\begin{aligned} \mathbf {z^* } = \mathop {\text{ arg } \text{ max }}_\mathbf {z} p(\mathbf {z} | \mathbf {\hat{z}}, \mathbf {x}, \theta ^{old}) = \mathop {\text{ arg } \text{ max }}_\mathbf {z} \frac{p(\mathbf {z}, \mathbf {\hat{z}} | \mathbf {x}, \theta ^{old})}{p(\mathbf {\hat{z}|\mathbf {x}})} = \mathop {\text{ arg } \text{ max }}_\mathbf {z} p(\mathbf {z} , \mathbf {\hat{z}} | \mathbf {x}, \theta ^{old}), \end{aligned}$$
(3)

using the fact that \(p(\mathbf {\hat{z}|\mathbf {x}})\) does not depend on \(\mathbf {z}\).

By assuming a complete dependency graph between all \(\mathbf {z},\mathbf {\hat{z}}\), the conditional joint distribution can be factorized and the E step can be written as the following CRF optimization problem:

(4)

where \(\mathcal {C}_u(\cdot )\) denotes the set of all unary cliques of a set of variables and \(\mathcal {C}_p(\cdot )\) denotes the set of all pairwise cliques. The unary potential function \(\psi _u\) acting on the latent variables is defined using the current network output as

$$\begin{aligned} \psi _u(z|\mathbf {x}, \theta ^{old}) = -\ln p(z_i | \mathbf {x}, \theta ^{old}) = -\ln f(x;\theta ^{old}). \end{aligned}$$
(5)

The unary potential function \(\hat{\psi }_u\) acting on the seed regions \(\mathbf {\hat{z}}\) is defined as 0 for labellings matching the ground truth and infinity otherwise, effectively preventing the initially grown regions from changing. Furthermore, we use the pairwise potential function \(\psi _p\) proposed in [8]:

$$\begin{aligned} \begin{aligned} \psi _p(z_i, z_j | x_i, x_j) = \mu (z_i, z_j)&\left( w_1\exp \left( -\frac{dist(x_i, x_j)^2}{2\sigma _\alpha ^2} -\frac{|x_i - x_j|^2}{2\sigma _\beta ^2} \right) \right. \\&\quad +\left. w_2\exp \left( -\frac{dist(x_i, x_j)^2}{2\sigma _\gamma ^2} \right) \right) , \end{aligned} \end{aligned}$$
(6)

where the label compatibility function is given by the Potts model \(\mu (z_i, z_j) = [z_i \ne z_j]\), and \(dist(\cdot , \cdot )\) denotes the Euclidean distance between the pixel locations. We estimate the hyperparameters \(w_1, w_2, \sigma _\alpha , \sigma _\beta , \sigma _\gamma \) in a grid search on the validation set. In order to optimize Eq. 4 we use the approach in [8]. We also consider a simple modification of this procedure as a baseline in which we set the pairwise terms to zero and only use the unary terms \(\psi _u, \hat{\psi }_u\).

In the M step, after we have found the optimal labeling of the latent variables \(\mathbf {z^*}\) using the network parameters \(\theta ^{old}\) we can rewrite Eq. 2 as

$$\begin{aligned} \begin{aligned} Q(\theta , \theta ^{old})&\approx \sum _{\mathbf {z}}\delta (\mathbf {z}=\,\mathbf {z^*}|\mathbf {\hat{z}},\mathbf {x},\theta ^{old}) \ln p(\mathbf {z},\mathbf {\hat{z}}| \mathbf {x}, \theta ) \\&= \ln p(\mathbf {z^*} | \mathbf {x}, \theta ) + \ln p(\mathbf {\hat{z}}|\mathbf {z^*}, \mathbf {x}), \end{aligned} \end{aligned}$$
(7)

where \(\delta \) is the Dirac delta function, the approximate equality is due to the hard EM approximation and we substituted Eq. 1 to obtain the equality. Since \(\ln p(\mathbf {\hat{z}}|\mathbf {z}, \mathbf {x})\) does not depend on \(\theta \) the optimization can be written as

$$\begin{aligned} \theta ^* = \mathop {\text{ arg } \text{ max }}_\theta \ln p(\mathbf {z^*} | \mathbf {x}, \theta ). \end{aligned}$$
(8)

We find the parameters \(\theta \) that maximize the likelihood of predicting the labels \(\mathbf {z^*}\) by minimizing the pixel-wise cross entropy function between the labels \(\mathbf {z^*}\) and the network output using the ADAM optimizer with an initial learning rate of 0.001 which is multiplied by 0.9 every 3000 iterations. We use the modified U-Net segmentation network used in [1] in all experiments. The network parameters \(\theta \) for each recursion are initialized with \(\theta ^{old}\). The E and the M steps get repeated until convergence, which typically occurs within 3 recursions or less.

In the first recursion, we set the cross-entropy loss to zero in all locations where the random walk is “uncertain” (probabilities below \(\tau \)), allowing the network to predict any label in those regions. We also explore a strategy to identify uncertain regions in subsequent iterations, which will be discussed in Sect. 2.4

2.3 Integrated Network Training and (CRF-RNN)

Here, we investigate estimation of the CRF parameters as part of the network training. To that end we use the CRF-RNN layer proposed in [13] which learns individual kernel weights for each class and a more flexible compatibility matrix.

To obtain a new labeling \(\mathbf {z^*}\) we simply run a forward pass through the network. Next, in order to prevent the original seed regions \(\mathbf {\hat{z}}\) from changing, we simply reset those values to their original label. In future work, we aim to include this constraint directly into the CRF-RNN formulation.

In the subsequent network optimization step, we directly learn to predict those \(\mathbf {z^*}\). Here we use the following training scheme: the network parameters are trained as above for 10 mini-batch iterations while keeping the RNN parameters constant. Every 10 iterations, the RNN parameters are updated with a learning rate of \(10^{-7}\), while freezing the remainder of the network parameters. As before, the label estimation and training steps are repeated until convergence.

2.4 Quantifying Segmentation Uncertainty

In order to prevent segmentation errors from early recursions from propagating we investigate the following strategy to reset labels predicted with insufficient certainty after each E step. We add dropout with probability 0.5 to the 5 innermost blocks of our U-Net architecture during training. In order to estimate the new optimal labeling \(\mathbf {z^*}\) we perform 50 forward passes with dropout similar to [7]. Rather than a single output this yields a distribution of logits and softmax outputs for each pixel and label. We then compare the logits distributions of the label with the highest and second highest softmax mean for each pixel using a Welch’s t-test. If the logits come from a distribution with the same mean with \(p\ge 0.05\) we conclude that the label was not predicted with sufficient certainty and reset its labeling to “uncertain”. Thus, in the subsequent M-step the network will be free to predict any label in that location. Otherwise, we set the pixel to the label with the highest probability.

3 Experiments and Results

We trained and evaluated the methods on two publicly available datasets: the ACDC cardiac segmentation challenge data [2] for which the Myocardium (Myo), the left and right ventricles (LV and RV) have been annotated, and the NCI-ISBI 2013 prostate segmentation challenge data [3] for which reference annotations of the central gland (CG) and the peripheral zone (PZ) were available. For the cardiac data we split the data into 160 training volumes and 40 validation volumes, and evaluated the algorithms on 100 images using the challenge server. For the prostate data we split 29 available training volumes into 12 training, 7 validation and 10 testing volumes. Training was performed on 2D slices.

We used \(\tau =0.99\) for the cardiac and \(\tau =0.90\) for the prostate experiments. For the separate CRF we used \(w_1=5, w_2=10, \sigma _\alpha =2, \sigma _\beta =0.1, \sigma _\gamma =5\) for the cardiac experiments and \(w_1=6, w_2=10, \sigma _\alpha =3, \sigma _\beta =0.01, \sigma _\gamma =2, \tau =0.9\) for the prostate, and for the CRF-RNN we used \(\sigma _\alpha =160\), for the cardiac data, \(\sigma _\alpha =250\) for the prostate, and \(\sigma _\beta =3, \sigma _\gamma =10\) for both datasets.

In the following experiments, the simple recursive training strategy which does not make use of pairwise terms in Eq. 4, nor uncertainty estimation, is called base. We evaluated the performance with and without the components discussed above. Additionally, we also investigated the same segmentation architecture on the fully labeled data to obtain an upper bound on the performance, and a version of base in which we did not perform any recursions, but used the network parameters learned directly on the seed regions \(\hat{\mathbf {z}}\).

The Dice scores with respect to the reference annotations for all the examined methods and structures are shown in Table 1. Note that ACDC challenge server did not allow for higher precision Dice reporting in the post-challenge phase. Example segmentations for the two best performing methods are shown in Fig. 3 for the cardiac and prostate data, respectively.

Table 1. Dice scores on Cardiac and Prostate datasets.
Fig. 3.
figure 3

Randomly sampled example segmentations for the two best performing training strategies for the (a) cardiac and (b) prostate data.

We observe that (a) the recursive training regime led to substantial improvements over non-recursive training, (b) the dropout based uncertainty was responsible for the largest improvements, (c) additional CRF led to further, albeit smaller improvements, (d) using CRF-RNN without uncertainty led to similar results as the separate CRF with uncertainty, (e) applying dropout uncertainty in conjunction with the CRF-RNN did not lead to additional improvements and performed slightly worse on the prostate. We believe this is due to the CRF-RNN module leading to unusual logit distributions at its input. On average, the training frameworks with (1) CRF-RNN, and with (2) separate CRF and uncertainty performed the best and similar to each other. Future work on integrating uncertainty with the CRF-RNN may lead to further improvements.

Most importantly, the results show that our proposed training strategy allows to learn a pixel-level segmentation network using scribble supervision alone with a remarkably small degradation compared to the fully supervised upper bound. For instance, the performance of the CRF-RNN method is only 4.5% worse on the prostate, and 2.9% worse on the cardiac data compared to fully supervised training. These results are also confirmed by the qualitative analysis. We believe this is likely an acceptable error margin for certain quantification studies where precise border delineation is of secondary importance such as automatic estimation cardiac ejection fractions [2].

4 Conclusion

In this paper, we investigated training strategies to train a fully automatic segmentation network with scribble supervision alone. We demonstrated the feasibility of the techniques on two publicly available medical image datasets and showed that only a remarkably small performance degradation is incurred with respect to fully supervised upper bound networks.