1 Introduction

Ultrasound imaging is a dominant modality for maternal and fetal health monitoring during pregnancy. However, traditional 2D planar ultrasound scanning is implicit to inspect anatomies and thus brings about inevitable user-dependency and diagnosis error. With broad volumetric field of views, 3D prenatal ultrasound is rapidly emerging as a viable alternative. Volumetric biometrics have been proposed and attached great desire for more accurate fetal growth evaluation [11]. Versatile as it is, the widespread of 3D prenatal ultrasound is still limited due to the great lack of efficient ways to be decomposed. Semi-automatic segmentation systems, like VOCAL [14], have been applied in clinic. However, these systems often involve cumbersome interactions and result in diagnosis discrepancy. Under this situation, automated volumetric segmentation techniques are highly demanded to accurately interpret prenatal ultrasound volumes.

Fig. 1.
figure 1

From left to right: sagittal, traverse, coronal plane and a cutaway view of volumetric segmentation of a prenatal ultrasound volume. Fetus, gestational sac and placenta in planes and segmentation are denoted with green, ocean blue and red color.

As depicted in Fig. 1, simultaneously segmenting multiple objects, including fetus, gestational sac and placenta, in prenatal ultrasound volumes remains as a very arduous task. Firstly, speckle noise, acoustic shadow and low contrast between tissues conspire towards the ubiquitous boundary ambiguity and deficiency. Secondly, the spatial consistency of objects in ultrasound volume is degraded along the directions which are perpendicular to the acoustic beam. Thirdly, fetus, gestational sac and placenta present large appearance variances, highly irregular shapes and floating spatial relationships.

Utilizing tissue intensity distribution, Anquez et al. [1] made early attempt to segment utero-fetal volume unit. Stevenson et al. [15] proposed a semi-automatic method to extract placenta volume. Intensity priors exploited in these methods degrade their robustness against appearance diversity across subjects. Lee et al. [7] built boundary traces to extract limb volume for fetal weight estimation. Recently, Andrea et al. [3] explored statistical shape model to analyze fetal facial morphology. However, confined by limited training data, shape models can not tackle highly varying objects, like fetus and placenta in Fig. 1. The huge surge of deep learning [9] are taking the dominant role over traditional methods [12] for ultrasound image segmentation. However, the limited receptive field degrades the capability of deep networks, like Convolutional Neural Networks, in conquering arbitrary sized boundary incompleteness [2].

In this paper, we are looking at the problem of volumetric segmentation in prenatal ultrasound. Our contribution is threefold. First, we propose a general framework for simultaneous segmentation of multiple complex objects in ultrasound volumes, including fetus, gestational sac and placenta, which remains a rarely-studied but great challenge currently. To the best of our knowledge, this is the first fully automatic solution in the field. Second, based on our customized 3D Fully Convolutional Network, we propose to inject a Recurrent Neural Network (RNN) to flexibly explore 3D semantic knowledge from a novel, sequential perspective and therefore significantly refine the local segmentation result. Coupled with an effective serialization strategy, our RNN proves to successfully tackle the ubiquitous boundary uncertainty in ultrasound volume. Third, to attack the gradient vanishing problem and consider the latent hierarchy in sequence, we introduce a hierarchical deep supervision mechanism (HiDS) to effectively boost the information flow within RNN and further improve the semantic segmentation. Validated on a large dataset, our approach achieves superior performance and presents to be promising in decomposing prenatal ultrasound volumes.

Fig. 2.
figure 2

Schematic view of our proposed framework. For probability volumes, from top to bottom: background, fetus, gestational sac and placenta.

2 Methodology

Figure 2 is the schematic view of our proposed framework. System input is an ultrasound volume. Our customized 3D FCN firstly conducts dense voxel-wise semantic labeling and generates intermediate probability volumes for different classes. The RNN trained with hierarchical deep supervision then explores contextual information within multiple volume channels to refine the semantic labeling. System output are extracted volumes of fetus, gestational sac and placenta.

2.1 Initial Dense Semantic Labeling with 3D FCN

Fully Convolutional Network (FCN) [10] is popular in semantic segmentation for its capability in end-to-end mapping. U-net [13] promotes FCN by adding skip connections to merge feature maps from different semantic levels. Skip connections are critical for network to recognize possible boundary details in ultrasound image. Since volumetric data inherently provide more complete stereo information than 2D planar images, it’s also desired if the network can digest 3D data directly [4]. Therefore, as shown in Fig. 2, by equipping all layers with 3D operators, we customize a 3D FCN with long skip connections to efficiently conduct dense semantic labeling on prenatal ultrasound volumes. Specifically, we take element-wise sum operator to merge feature volumes from different resolutions and thus smooth the gradient flow. To suppress computation cost, we adopt small convolution kernels with size of 3 \(\times \) 3 \(\times \) 3 in convolutional layers (Conv). Each Conv layer is followed by a batch normalization (BN) layer and a rectified linear unit (ReLU). 3D FCN outputs probability volumes for different classes.

2.2 Semantic Labeling Refinement with RNN

As we observe, local boundary deficiency in ultrasound volumes tend to corrupt 3D FCN’s semantic predictions. Leveraging contextual information is effective in addressing boundary incompleteness. Motivated by [2, 17], and being different from using traditional, fixed structures to collect context cues [16], we propose to explore Recurrent Neural Networks (RNNs) to flexibly encode contextual knowledge and refine the semantic labeling from a novel, sequential perspective. With internal memory cells, RNNs infer current timestep output by considering current input and historical information accumulated in hidden state. In our case, RNN will sequentially run over the local space, the dynamic hidden states can thus be interpreted as local contextual knowledge and be utilized to recover corrupted boundary. Our RNN is trained after the training of 3D-FCN. Shown as Fig. 2, by taking the concatenation of probability volumes and raw ultrasound volume, RNN can distill rich context information for prediction enhancement. Specifically, we propose to exploit Bidirectional Long-Short Term Memory (BiLSTM) [5] network, a popular RNN variation, in our framework to capture long range spatial dependencies and arouse interactions between sequential information flows from different directions, shown as Fig. 2. Mathematically, given an input sequence \(\varvec{x}=(x_{1},...,x_{T})\) and a target sequence \(\varvec{y}=(y_{1},...,y_{K})\), BiLSTM models the probability of current timestep output by the following equations:

$$\begin{aligned} \overrightarrow{h}_{t}&= \overrightarrow{\mathcal {H}}(W_{x\overrightarrow{h}}x_{t} + W_{\overrightarrow{h}\overrightarrow{h}}\overrightarrow{h}_{t-1} + b_{\overrightarrow{h}}) \; \end{aligned}$$
(1)
$$\begin{aligned} \overleftarrow{h}_{t}&= \overleftarrow{\mathcal {H}}(W_{x\overleftarrow{h}}x_{t} + W_{\overleftarrow{h}\overleftarrow{h}}\overleftarrow{h}_{t-1} + b_{\overleftarrow{h}}) \; \end{aligned}$$
(2)
$$\begin{aligned} \hat{y}_{t}&= W_{\overrightarrow{h}y}\overrightarrow{h}_{t} + W_{\overleftarrow{h}y}\overleftarrow{h}_{t} + b_{y}, \end{aligned}$$
(3)

where W terms denote weight matrices, h terms denote internal hidden states which are controlled by tunable gates, b terms denote bias vectors. \(\overrightarrow{\mathcal {H}}\) and \(\overleftarrow{\mathcal {H}}\) are hidden layer functions. By serializing volumes into sequences and trained with cross-entropy loss function, our BiLSTM conducts direct sequence-to-sequence mapping and output the refined voxel labeling results with a softmax layer.

Different serialization manners differ in mining the sequentiality of volumetric data. We find that, by choosing proper size for sequence primitives, serializing a volume into a sequence of overlapping cubes can provide better capability than the slice based serialization manner in [2]. With this manner, a 50 \(\times \) 50 \(\times \) 50 volume can be evenly divided into more than 1000 overlapped 7 \(\times \) 7 \(\times \) 7 cubes, these cubes are then sequentially concatenated to form a sequence. Deserialization is the inverse. BiLSTM captures context cues over the long sequence and significantly refines the labeling result, and, as detailed in Sect. 2.3, we can get further improvement by coupling our RNN with a profound training mechanism.

2.3 Network-Specific Deep Supervision Mechanism

Subject to gradient vanishing issue, the parameter tuning processes of our 3D-FCN and RNN are at high risks of low efficiency and overfitting. In this paper, we propose a network-specific deep supervision strategy to facilitate the system training. For the 3D FCN part, we adopt the deep supervision strategy introduced in [4, 6], which promotes training by exposing shallow convolutional layers to the direct supervision of \(\mathcal {M}\) auxiliary classifiers. The final loss function for our deeply supervised 3D FCN is formulated as Eq. 4, where \(\mathcal {X}\), \(\mathcal {Y}\) are training pairs, \(\mathcal {W}\) is the weight of main network. \(w = (w^1,w^2,..,w^m)\) are the weights of auxiliary classifiers, \(\alpha _m\) is the corresponding ratio in final loss. \(m=2\) in this paper. Cross entropy is used as a metric for main loss \(\mathcal {L}\) and auxiliary \(\mathcal {L}_m\).

$$\begin{aligned} \mathcal {L}(\mathcal {X},\mathcal {Y};\mathcal {W},w) = \mathcal {L}(\mathcal {X},\mathcal {Y};\mathcal {W}) + \sum _{m \in \mathcal {M}}{\alpha _m\mathcal {L}_m(\mathcal {X}, \mathcal {Y};\mathcal {W},w^m)} + \lambda (||\mathcal {W}||^2) \end{aligned}$$
(4)

Hierarchical Deep Supervision for RNN. Although BiLSTM has gating functions to guide gradient flow, it’s nontrivial for BiLSTM to effectively tune gates and parameters for early timesteps. BiLSTM may be over-tuned to fit latter part of sequences for convergence, especially when tackling sequences with extreme length (\({\ge }1000\)), which is exactly our case. Traditional training strategy for RNN is to attach a loss function at the end of the chain, and rare studies have been reported for deep supervision mechanisms in RNN. The target label replication strategy proposed in [8] is intractable for our sequence-to-sequence mapping task. A proper deep supervision strategy for RNN should consider the following two facts: (i) auxiliary supervision should be injected in early timesteps to shorten the gradient backpropagation path; (ii) the locations to trigger auxiliary supervision should consider latent, hierarchical context dependencies in the sequence. Rooting in these thoughts, we propose a novel, hierarchical deep supervision mechanism to boost the training efficiency and generalization of RNN, denoted as HiDS, shown in Fig. 3(a). Sharing same anchor point, with a main loss function for whole sequence, HiDS attaches auxiliary loss functions along the sequence with gradually increasing scopes. Equation 5 illustrates the final loss function with HiDS, where \(\mathrm {X}\), \(\mathrm {Y}\) are input and output sequences with length T and \(T=\mathcal {N}p\). W is weight matrix of RNN shared by all timesteps. \(\mathscr {L}_{N}\) is the main loss function charging the complete sequence, \(\mathscr {L}_n\) are auxiliary loss functions, \(\beta _n\) are the associated ratio in final loss \(\mathscr {L}\). \(\beta _n=1\) in this paper.

$$\begin{aligned} \mathscr {L}(\mathrm {X},\mathrm {Y};W) = \mathscr {L}_{N}(\mathrm {X},\mathrm {Y};W)+ \sum ^{\mathcal {N}-1}_{n=1}{\beta _n\mathscr {L}_n(\mathrm {X}_{1 \le t< np},\mathrm {Y}_{1 \le t < np};W)} \end{aligned}$$
(5)

Figure 3(b) provides a proof about cross entropy based HiDS in boosting the training of BiLSTM over sequences with 1000 timesteps. BiLSTM equipped with 3, 7 auxiliary loss functions get faster convergence speeds and lower training errors than that in BiLSTM only with main loss function. Improvement in generalization ability brought by HiDS will be elaborated in Sect. 3.

Fig. 3.
figure 3

Illustration of the hierarchical deep supervision mechanism for RNNs.

3 Experimental Results

Materials and Implementation Details: We firstly built a dataset consisting of 104 anonymized prenatal ultrasound volumes acquired from 104 pregnant women volunteers with gestational age 10–14 weeks. Our dataset is the largest one reported in the field. The average size of volume is 221 \(\times \) 198 \(\times \) 283 with a voxel size of 0.5 \(\times \) 0.5 \(\times \) 0.5 mm. Approved by local IRB, all volumes were obtained by Mindray DC-8 ultrasound systems with integrated 3D probes. 10 experienced radiologist provided annotations for all volumes under strict quality control. The dataset is randomly split into 50, 10 and 44 volumes for training, validation and testing. We further augmented the training dataset to 150 volumes by flipping and rotation. 3D FCN is implemented in caffe, BiLSTM in Theano. Restricted by GPU memory, 3D FCN takes 64 \(\times \) 64 \(\times \) 64 sub-volume as input. 800 internal memory cells are allocated for forward and backward branch each in BiLSTM. The input of BiLSTM are 5 sub-volumes (50 \(\times \) 50 \(\times \) 50) cropped in ultrasound and 4 probability volumes. These sub-volumes are firstly serialized into sequences of overlapped cubes with size 7 \(\times \) 7 \(\times \) 7, and then flattened and concatenated, and finally input into BiLSTM step by step. We adopt sliding window and overlap-tiling stitching strategies to generate predictions for whole ultrasound volumes.

Table 1. Quantitative evaluation of our proposed framework
Fig. 4.
figure 4

From left to right: cutaway view of ultrasound volume, cutaway view of complete segmentation to show spatial relationship, volume of fetus, gestational sac and placenta.

Quantitative and Qualitative Analysis: To consider both region and boundary similarities, we adopt 4 metrics to evaluate the proposed framework on segmentation, including Dice coefficient (\(DSC=2(A\cap B)/(A+B)\)), Conformity (\(Conf=(3DSC-2)/DSC\)), Hausdorff Distance of Boundaries (Hdb[mm]) and Average Distance of Boundaries (Adb[mm]). Ablation study is conducted on our framework. Deeply supervised 3D FCN is taken as a competitive baseline and denoted as 3D-F. 3D-F with basic BiLSTM for refinement is denoted as FB. FB equipped with 3, 7 HiDS auxiliary loss functions are denoted as FB3Hi and FB7Hi. Table 1 illustrates the extensive comparison between these methods in segmentation. Obvious improvement in both 3 classes occur when our modules are applied successively. It is most challenging to segment placenta, but with the context information contributed by our BiLSTM, we obtained the improvement of more than 4% in DSC and 1.9 mm in Hdb. Enhancing the generalization abilities, our deep supervision mechanism HiDS boosts the segmentation performance in all metrics. Explicit visualization of 3 semantic segmentation results produced by FB7Hi for fetus, gestational sac and placenta are shown in Fig. 4. Our method conquers the poor image quality, complicated spatial configuration (even twins), boundary deficiency and spatial inconsistency in ultrasound volumes and presents smooth, promising segmentation for both 3 classes.

4 Conclusions

We present the first fully automatic framework for semantic segmentation in ultrasound volumes, which would potentially promote fetal health monitoring and open opportunities for many crucial clinical studies which can not be achieved via 2D planar ultrasound. We explore RNN to flexibly encode local contextual knowledge and therefore refine the corrupted predictions from a novel, sequential perspective. By closely coupling the RNN with a hierarchical deep supervision mechanism, the latent hierarchy in sequence is distilled to further boost segmentation performance. Promising quantitative and qualitative results are achieved on a large dataset. More clinical studies will be conducted in the near future.