1 Introduction

Diseases of the female reproductive system can cause pain and discomfort, hormonal dysfunctions, infertility and even death, representing around 16% of all the cancers diagnosed in women worldwide and affecting annually more than 1.85 million women [10]. Due to the difficulty of distinguishing between benign and malignant tumors and to a high interobserver variability, ovarian malignant tumors result in a 68% fatality rate in the European Union [10]. A better characterization of the ovarian structures can have an important role in the early detection of pathologies (e.g ovarian cyst, polycystic ovarian syndrome or even ovarian cancer), while it can also help the monitoring of follicle growth and distribution, important features for assisted reproductive treatments.

The brightness mode (B-mode) ultrasound imaging is commonly used in the gynecological clinical practice because it allows a visualization of the ovary and its structures. In B-mode images, follicles are represented as hypo-echogenic elliptical structures, while the stroma of the ovary exhibits a slight variation in texture relative to its surrounding tissue and has partially hyper-echogenic boundaries. Figure 1 shows an example of a gynecological B-mode image containing an ovary with three follicles, and their manual segmentation.

Image segmentation methods are often used to automatically extract objects from images, reducing the time of analysis and also diagnostic errors. However, ultrasound image segmentation is not easy due to the presence of several image artifacts and noise [6]. According to the latest review in follicles detection [7], the methods used to segment ovarian structures can only detect and measure large follicles. To the best of our knowledge, the segmentation of the stroma has not received enough attention, being only used to reduce the search space for follicle detection [1].

Neural network techniques have been achieving impressive results in visual recognition systems. Among them, fully convolutional neural networks (fCNN) are specially good at learning image features from training data and have proved to be a powerful tool for segmentation of biomedical images [8]. The herein presented research aims to explore the use of fCNNs for the segmentation of the ovarian structures, namely stroma and follicles, in a single process.

Fig. 1.
figure 1

B-mode image of an ovary and three follicles, and their segmentation.

2 Methodology

This section presents the methods implemented in this work to segment the ovarian structures in B-mode images. In the following subsections, the proposed system, its fCNN architecture and loss functions used are detailed.

2.1 Architecture

An overview of the proposed system is shown in Fig. 2. Switches \(S_{1}\) and \(S_{2}\) can be triggered to change the input data of the network and the tasks to be trained, respectively. These changes can work as regularization of the fCNN.

Fig. 2.
figure 2

Overview of the proposed system with multiple tasks.

When switch \(S_{1}\) is turned on, the B-mode image is preprocessed by a contrast limited adaptive histogram equalization (CLAHE) [11] in order to enhance local contrast and improve the visualization of the ovarian structures. Both CLAHE and original images can be used as input data, as represented in Fig. 2, left side.

For the training, the switch \(S_{2}\) can be used to activate the multi-task learning, which consists of using the same network to simultaneously solve multiple tasks. In this work, a mask of the ovary is used as ground truth of an auxiliary task in order to prevent the network to classify elements outside the ovary as follicles or from classifying pixels inside of the ovary as background. The auxiliary task acts as regularization during the training of the network [9], and can help the network to focus the attention on difficult cases [3].

The fCNN architecture used in this work (Fig. 3) is based on the U-net [8]. This architecture consists of a downsampling stream (left side) followed by a symmetric upsampling stream (right side). Data from downsampling stream are skip connected to the corresponding layer in the upsampling stream. The convolutional layers are followed by a batch normalization layer and ReLu activation layer; also a dropout layer is inserted between them, when pooling or concatenating operations are performed. The last layer is a \(1\times 1\) convolution followed by a softmax, which produces a pixel-wise discrete probability distribution of the three classes of interest (follicle, stroma of the ovary or background).

Fig. 3.
figure 3

Architecture of the implemented fCNN.

2.2 Loss Function

The proposed loss function can be decomposed into the main and the auxiliary tasks. Also, weight maps can be applied as regularization, in order to penalize wrong classifications. The details of each step are explained below.

The average Dice Similarity Coefficient (DSC) of each class, as proposed in [5], is the main component of the loss function. The average DSC can be defined as \(\overline{ DSC }(Y,\hat{Y}) = 0.5 [ DSC (Y_f,\hat{Y}_f) + DSC (Y_s,\hat{Y}_s)]\), where Y represents the predictions and \(\hat{Y}\) represents the ground truth (GT); the indexes f and s represent the follicles and the stroma, respectively. The background was not considered in the loss function because it is the largest region in the image and, so, the results can be heavily influenced by it.

In addition, two weight maps were computed to be applied with the DSC. The first one (\(W_f\)) intends to penalize wrong classifications between nearby follicles, as in w(x), defined by U-net [8]. The value of \(W_f (i)\) is calculated using the distance between the ith pixel and the borders of the two nearest follicles. The second one (\(W_o\)) is applied to penalize false detections of ovarian structures in the background region, and is defined for each pixel as:

$$\begin{aligned} W_{o} (i) = {\left\{ \begin{array}{ll} 0 &{} \text {if} \quad i \quad \text {is inside the ovary} \\ 1 &{} \text {if} \quad \varDelta _{o}(i) > \ln (10) \sigma ^2 \\ 0.1 \cdot \exp (\frac{\varDelta _{o}(i)}{\sigma ^2}) &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(1)

where \(\varDelta _{o}(i)\) is the distance from pixel i to the nearest pixel of the ovary, and \(\sigma \) is a constant that controls the distribution of the weights around the ovary.

Then, the loss function of the main task is computed as:

$$\begin{aligned} \mathcal {L}_{1}(Y,\hat{Y}) = 1 - \lambda _1\overline{ DSC }(Y,\hat{Y}) + \lambda _2 \frac{\sum _i Y_f(i) W_f(i)}{\sum _i Y_f(i)} + \lambda _3 \frac{\sum _i Y_s(i) W_o(i)}{\sum _i Y_s(i)}, \end{aligned}$$
(2)

where \(\lambda _{1,2,3} \in \mathbb {R}^+\) are constants used to adjust the influence of each weight map.

The loss function of the auxiliary task is given by:

$$\begin{aligned} \mathcal {L}_{2}(Y_o,\hat{Y}_o) = 1 - DSC (Y_o,\hat{Y}_o), \end{aligned}$$
(3)

where \(\hat{Y}_o\) is the GT mask of the ovary and \(Y_o\) is the predicted ovary.

Finally, the total loss function is defined as:

$$\begin{aligned} \mathcal {L}(Y,\hat{Y}) = \alpha _1 \mathcal {L}_{1}(Y,\hat{Y}) + \alpha _2 \mathcal {L}_{2}(Y_o,\hat{Y_o}) \qquad \forall \alpha _{1,2} \in \mathbb {R}^+ | \alpha _{1} + \alpha _{2} = 1, \end{aligned}$$
(4)

where \(\alpha _{1,2}\) are constants used to adjust the influence of each component.

2.3 Implementation Steps

The proposed system was evaluated using six variations. The input data of the fCNN (Fig. 2) is changed by the switch \(S_{1}\); and the switch \(S_{2}\) controls the multi-task learning. When \(S_{2}\) is “off”, the Eq. (4) is written with \(\alpha _1 = 1\) and \(\alpha _2 = 0\), otherwise \(\alpha _1 = 0.75\) and \(\alpha _2 = 0.25\). Finally, the values of \(\lambda _2\) and \(\lambda _3\), in Eq. (2), determine if the weight maps are added or not to the loss function. For all these experiments \(\lambda _2 = \lambda _3 \in [0,1]\), and \(\lambda _1= 1\). The values of \(\alpha \), \(\lambda \) and \(\sigma \) were defined empirically and were not changed during each train.

All original B-mode images were converted to gray-scale and cropped to \(512 \times 512\) pixel. Aside from the batch normalization layers, no regularization or normalization to zero mean and unit variance were applied to the input images. To increase the training set, a data augmentation process using random linear transformations such as rotation, translation, flip and zoom was applied in each iteration of the training. Each iteration was performed with a batch of 4 images.

The network was trained using Adam (Adaptive Moment Estimation) optimizer [4] with an initial learning rate of \(10^{-2}\). In this state-of-the-art stochastic optimization method, there is a learning rate for each weight of the network, and the learning rates are adapted during the training. To reduce the probability of overtraining, an early stopping callback is set to stop the training if the improvement of the validation loss is less than \(10^{-3}\), during 50 epochs. This work was implemented in Python 2.7 using Keras 1.2.2 framework with TensorFlow 1.0.0 as backend.

3 Results

This section presents the dataset, the evaluation methodologies and the obtained results.

3.1 Dataset

The dataset consists of 87 B-mode images. Each image contains one ovary of a woman in childbearing age with at least one follicle. The images were acquired with an Ultrasonix SonixTouch Q+. During acquisition, the medical doctor performed semi-automatic segmentations using the Ultrasonix Auto Follicle segmentation (AF) tool [2]. It must be noted that not all of the follicles were segmented by the doctor, leading to, for instance one ovary with 4 follicles and only one semi-automatic segmentation. Posteriorly, a medical expert manually segmented each ovary and each follicle to produce the GT. The images were randomly divided as: 57 for training, 15 for validation and 15 for testing.

3.2 Evaluation

The quantitative evaluation of the results was divided into two different validation methodologies. First, the DSC between the GT and the predicted segmentations, obtained by the different trained networks, are presented. Secondly, a single follicle evaluation (SFE) was performed and then compared with the AF segmentation, mentioned in Sect. 3.1.

The motivation for SFE lays on the fact that a GT mask may have more segmented follicles than the ones annotated by the doctor using the AF tool during the acquisition. For example, while the GT of the test set has 44 follicles manually segmented, only 25 follicles were annotated with the AF tool. The SFE verifies if a follicle segmented by the AF has a corresponding follicle in the GT and in the fCNN segmentations. Then, for each follicle present in the AF data, the DSC of GT vs AF and GT vs fCNN are computed.

In Fig. 4 two scenarios of SFE are illustrated. Figure 4(a) represents a SFE with a larger overlay while Fig. 4(b) represents an incorrect segmentation. In this case, the fCNN and the AF segmented a large single follicle which merged the existing two follicles into one, leading to an inaccurate detection.

Fig. 4.
figure 4

Illustration of SFE for GT (green) vs AF/fCNN (red); yellow represents the agreement with GT. The correspondences between the GT and the predictions are represented by the arrows. (a) Successful case where a single follicle in GT corresponds to a single follicle segmented by AF and fCNN; (b) an incorrect segmentation since two follicles were merged. (Color figure online)

3.3 Results

The overall DSC results for the six trained networks are shown in Table 1. The fCNNs #1 and #4 show the best overall DSC for the follicles and the stroma. In Fig. 5, four examples of the segmentation performed by the developed fCNNs are shown. The highest DSC achieved for follicles was 0.955, with the fCNN #1 – Fig. 5(a), and for the stroma was 0.855, with the fCNN #3 – Fig. 5(b). Also, a standard case and the image with the worst segmentations are presented in Fig. 5(c) and (d), respectively.

Table 1. Overview of DSC for the predicted segmentations of the fCNN trained.
Fig. 5.
figure 5

Examples of segmentation results: (a) the best follicle DSC, (b) the best stroma DSC, (c) a standard case, (d) the worst image.

In a qualitative analysis, the application of multi-task learning prevented follicles for being classified outside the ovary. This approach obtained a fast convergence in training and the smallest validation errors. However, for three test images with low contrast – e.g. Fig. 5(d), the ovarian structures were poorly or not detected, which impaired the overall results. The application of CLAHE improved the results and the use of the weight map \(W_o\) solved the problem of false positive ovaries. However, weight map \(W_f\) did not significantly reduce misclassification of the pixels between too close follicles; in addition, it produced the wrong classification of the outer boundary of the follicles as background.

The results of the SFE are presented in Table 2. The AF was overcome by all architectures except the fCNN #2. The best overall results for the SFE were obtained with the simplest architecture. Although the CLAHE improved the contrast in boundary regions, the SFE did not improve when CLAHE was used.

Table 2. Overview of DSC for single follice evaluation (SFE).

4 Conclusions

In this paper, the first supervised fCNN for the segmentation of the stroma and follicles of ovaries in B-mode images, in an end-to-end fashion, was presented. Despite being trained with a small dataset, the developed method does not depend on heavy preprocessing or post-processing strategies. The visual results show that a fCNN can learn features that allow to distinguish the ovarian structures in B-mode images. This functionality could allow a better characterization of the overlooked stroma region. Also, the proposed method proved to be more accurate than a commercialized semi-automatic method for follicle segmentation.

Despite presenting slightly better results in the validation set, the proposed regularization techniques show worse overall DSC results for the test set, when comparing with the simplest fCNNs (#1 and #4). This may have been caused by the increasing of the complexity of the segmentation task and by the overwhelming of the data information by the regularization terms. An improvement of the proposed regularizations should be investigated to yield better results.

For future steps of this investigation, the proposed fCNN will be extended to a deeper architecture, increasing the number of learnable features. Due to the scarcity of data, a k-fold cross-validation should be applied to better evaluate the consistency of each architecture. Also, a more efficient loss function will be elaborated in order to force the network to learn the boundaries of the follicles. Finally, the increasing of the dataset is fundamental to improve the variability of the training set.