1 Introduction

More and more systems for medical image segmentation rely on deep learning (DL). However, most publications on this topic report performance improvements for a particular segmentation task and imaging modality and use a specialized processing pipeline adapted through hyperparameter tuning. This makes it difficult to generalize the obtained results and bears the risk that the reported findings are artifacts. In line with the idea behind the 2018 Medical Segmentation Decathlon (MSD)Footnote 1 [1], a challenge evaluating the generalisability of machine learning based segmentation algorithms, we argue that new segmentation systems should be evaluated across many different data cohorts and maybe even tasks. This reduces the risk of unintentional method overfitting and may help to gain more general insights about, for example, superior model architectures and learning methods for particular problem classes. This does not only contribute to our basic understanding of the segmentation algorithms, but also to the clinical acceptance and applicability of the systems – even if the generality could come at the cost of not reaching state-of-the-art performance on each individual cohort or task.

A DL segmentation framework that works across a wide range of tasks and in which the individual components and hyperparameters are sufficiently understood allows to automate the task-specific adaptations. This is a prerequisite for being useful for practitioners who are not experts in DL. Big compute clusters offer a way to design systems that provide accurate segmentations for a variety of tasks and do not require tuning by DL experts. If compute resources are not limited, automatic model and hyperparameter selection can be implemented. Given new training data, the systems tests a large variety of segmentation algorithms and, for each algorithm, explores the space of the required hyperparameters. While this approach may produce powerful systems, and was employed to variable extents by top-performing MSD submissions, we argue that it has crucial drawbacks. First, it comes with a risk of automated method overfitting, even if the data is handled carefully. Second, the approach may be prohibitive in clinical practice (and for many scientific institutions) when there is simply no access to sufficient (data regulations compliant) compute resources.

This paper presents an open-source system for medical volume segmentation that addresses all the issues outlined above. It relies on a single neural network of fixed architecture that (1) showed very good performance across a variety of diverse segmentation tasks, (2) can be trained efficiently without DL expert knowledge, large amounts of data, and compute clusters, and (3) does not need large resources when deployed. The system architecture is a 2D U-Net [2, 3] variant. The decisive feature of our approach lies in extensive data augmentation, in particular by rotating the input volume before presenting slices to the fully convolutional network. Because of the latter, we refer to our approach as multi-planar U-Net training (MPUnet). We present a thorough evaluation of our system on a total of 13 different 3D segmentation tasks, including 10 from MSD, on which it obtains high accuracies – often reaching state-of-the-art performance from even highly specialized DL-based methods.

2 Method

At the heart of our system lies a 2D U-net [2] modified slightly to (1) include batch normalization layers [4] intervening each double convolution- and up-convolution block and (2) use nearest-neighbor up-sampling followed by convolution to implement up-convolutions [5]. Basic network topology and hyperparameters can bet set to their default choices as done in all experiments in this paper, see Table S.1 in the supplementary material for an overview. Compared to [2], the number of filters has been increased by a factor of \(\sqrt{2}\), see supplementary Table S.6 for details. As a result, the model has \({\approx }62\) million parameters. While one would assume that the size of the model is a crucial hyperparameter, we kept the model architecture the same for all tasks. For each task, only the filters in the first layer were resized according to the number C of input channels and the number of output units was set to the number of classes K.

Fig. 1.
figure 1

Model overview. In the inference phase, the input volume (left) is sampled on 2D isotropic grids along multiple view axes. The model predicts a full volume along each axis and maps the predictions into the original image space. A fusion model combines the 6 proposed segmentation volumes into a single final segmentation.

The decisive feature of our multi-planar U-Net training (MPUnet) is the generation of the inputs at training and test time, which is done by sampling from multiple planes of random orientation spanning the image volume. That is, the network must learn to segment the input seen from different views, see Fig. 1.

The model \(f(x; \theta )\) takes as input multi-channel 2D image slices of size \(w\times h\), \(x \in \mathbb {R}^{w \times h \times C}\), and outputs a probabilistic segmentation map \(P \in \mathbb {R}^{w \times h \times K}\) for K classes. Prior to training we define a set \(V = \{ v_1, v_2, ..., v_i \}\) of i randomly sampled unit vectors in \(\mathbb {R}^3\). The set defines the axes through the image volume along which we sample 2D inputs to the model, visualized in Fig. 2. We re-sample the set V until all pairs of vectors have an angle of at least \(60^{\circ }\) between them. A sampled set of planar axes is shown in Fig. 2(a). Note that the model could also be fit using a set of fixed, predefined planes, but we found no performance gain in doing so, even if the fixed set included the standard planes. We use \(i=6\) for all reported evaluations. This number was chosen based on prior experiments in which we observed monotonically improving performance with the inclusion of additional planes and \(i=6\) providing a good balance between accuracy and computation, see supplementary Table S.2.

During training, the model is provided batches of images randomly sampled from the i planes in V without supplying information about the corresponding axis. During inference, the model predicts along each plane producing a set of i segmentation volumes \(\mathbf {P} = \{ P_{v} \in \mathbb {R}^{w \times h \times d \times K} \,\vert \,v \in V \}\). Each \(P_{v}\) is mapped to the input image space to obtain point correspondence by assigning to each voxel in the input image the value of its nearest predicted point in \(P_{v}\). Distances are computed in physical coordinates.

At test-time, the learned invariance to orientation is exploited by segmenting the entire volume from each view. This results in several candidate segmentations for each subject, which are combined by a linear fusion model, see Fig. 1. We map \(\mathbf {P}\) to a single probabilistic segmentation by a weighted sum of the per-class and per-view softmax-scores. For all \(w \cdot h \cdot d\) voxels x in \(\mathbf {P}\) and each class \(k \in \{ 1, ..., K\}\), the fusion model \(f_{\text {fusion}}: \mathbb {R}^{|V| \times K} \rightarrow \mathbb {R}^{K}\) calculates \(z(x)_k = \sum _{n=1}^{|V|} W_{n, k} \cdot p_{n, x, k} + \beta _k\). Here \(p_{n, x, k}\) denotes the probability of class k at voxel x as predicted by segmentation \(P_n\). The \(W \in \mathbb {R}^{|V| \times K}\) weighs the probabilities of each class as predicted from each view and \(\beta \in \mathbb {R}^{K}\) are bias parameters, which can adjust the overall tendency to predict a given class. The parameters of \(f_{\text {fusion}}\) are learned from the validation data. The model scales the predictions according to which views do well on each class, motivated by the fact that different target classes may appear in different shapes and levels of recognizability when seen from the different directions in V.

Fig. 2.
figure 2

(a) Visualization of a set V of sampled view axis unit vectors. (b) Illustration of images sampled along one view. (c) Illustration of multiple images sampled along multiple unique views.

Isotropic Image Sampling. Interpolation is needed to sample image planes not aligned with the original voxel grid. We use tri-linear and nearest-neighbour interpolation to sample the image and label map, respectively. We take advantage of the necessity for interpolation by sampling images on isotropic grids in the physical scanner space, oriented according to the patient’s position in the scanner. This ensures that the model always operates on images in which the shapes of anatomical structures are maintained across scanners and acquisition protocols. Note that this approach may lead to over- or under-sampling along some axes, which may lead to loss of image information or interpolation artefacts. Empirically, however, we found that the benefit of maintaining isotropy outweighed potential drawbacks of interpolation.

We must define a set of parameters restricting the sampling. Specifically, we are free to choose (1) the pixel dimensions, \(q \in \mathbb {Z^+}\) (the number of pixels to sample for each image), (2) the real-space extent of the image (in mm), \(m \in \mathbb {R^+}\), and (3) the real-space distance between consecutive voxels, \(r \in \mathbb {R^+}\). Note that two of these parameters define the third. We restrict our sampling to equal q, m and r for both image dimensions producing squared images. We sample images within a sphere of diameter m centered at the origin of the scanner coordinate system. We employ a simple heuristic that attempts to pick q, m and r so that (1) the training is computable on our GPUs with batch sizes of at least 8, (2) r approximately matches the resolution of the images along their highest resolution axis and (3) the sampled images span the entirety of the relevant volume of all images in the dataset. When this is not possible, the requirements are prioritized in the given order, with 1 having highest priority. Note that 3 becomes less important with increasing numbers of planes as voxels missed in one plane are likely to be included in some of the others.

Augmentation. Processing the input image from different views has the same effect as applying affine transformations to the 3D input and presenting the transformed images to a (single-view) network. Thus, at the heart the MPUnet is a U-Net with extensive, systematic affine data augmentation. On top of the multi-view sampling, we also employ non-linear transformations to further augment the training data. We apply the Random Elastic Deformations algorithm [6] to each sampled image in a batch with a probability of 1/3. The elasticity constants \(\sigma \) and deformation intensity multipliers \(\alpha \) are sampled uniformly from [20, 30] and [100, 500], respectively. This generates augmented images with high variability in terms of both deformation strength and smoothness.

The augmented images do not always display anatomically plausible structures. Yet, they often significantly improve the generalization especially when training on small datasets or tasks involving pathologies of highly variable shape. However, we weigh the loss-contribution from augmented images by 1/3 in order to optimize primarily over true images.

Pre- and Post-processing. Our model uses a minimum of image processing outside of the network itself. We restrain from applying any post-processing of the model’s output, because post-processing is typically highly task-specific. We only apply an image- and channel-wise outlier-robust pre-possessing that scales intensity values according to the median and inter-quartile range computed over all non-background voxels. Background voxels are defined by having intensities less than or equal to the first percentile of the intensity distribution.

Implementation. The MPUnet is available as open-source. The fully autonomous implementation makes the MPUnet applicable also for users with limited deep learning expertise and/or compute resources. A command line interface supports fixed split or cross-validation training and evaluation on arbitrary images. Any non-constant hyperparameter can automatically be inferred from the training data. See the GitHub repository at https://github.com/perslev/MultiPlanarUNet for a user guide.

Table 1. Performance of the MPUnet across thirteen segmentation tasks. The shown F1 (dice) scores are mean values computed across all non-background per-class F1 scores. For the 10 MSD datasets evaluation was performed by the challenge organisers on non-publicly available test-sets. For MICCAI and HarP, evaluation was performed over three trials. Five fold cross-validation was used for OAI. The ‘Classes’ column include the background class, which is not included when computing the F1 scores. The ‘Size’ column gives the total dataset size. Note that the F1 standard deviations for tasks 8, 9 & 10 are not yet published by the challenge organizers. We refer to http://medicaldecathlon.com/results.html for a detailed comparison of our results (team CerebriuDIKU) with those of other challenge participants.

3 Experiments and Results

We applied the MPUNet without task-specific modifications to a total of 13 segmentation tasks. Ten of those datasets were part of the 2018 MSD challenge, described in detail and sourced on the challenge’s website. The remaining three datasets were the MICCAI 2012 Multi-Atlas Challenge (MICCAI) dataset [7], the EADC-ADNI Harmonized Hippocampal Protocol (HarP) dataset [8] and a knee MRI dataset from the Osteoarthritis Initiative (OAI) [9]. The evaluation covers healthy and pathological anatomical structures, mono- and multi-modal MR and CT, and various acquisition protocols. The mean per-class F1 (dice) scores of the MPUNet are reported in Table 1. Note that in MSD tumour segmentation tasks 3 & 7 both organ and tumour are segmented, and the mean F1 for those tasks is lifted by the performance on the organ and decreased by the performance on the tumour. We refer to the supplementary Table S.4 for detailed per-class scores for the ten MSD tasks.

The MPUnet reached state-of-the-art performance for DL methods on the three non-challenge datasets (MICCAI, HaRP and OAI) despite comparable methods being developed and tuned specifically to the cohorts and tasks. On MICCAI, with a mean F1 of 0.74 the MPUnet compares similar to the 0.74 obtained in [10] using a 2D multi-scale CNN on brain-extracted images and 0.75 obtained in [11] using a combination of a multi-scale 2D CNN, 3D patch-based CNN, a spatial information encoder network and a probabilistic atlas also on brain-extracted images. With a mean F1 of 0.85 on HarP, the MPUnet compares favorable to 0.78–0.83 (depending on subject disease state) reported in [12]. On OAI, with a mean F1 of 0.87, the MPUnet gets near the 0.88/0.89 (baseline/follow-up) obtained in [13] using a task-specific pipeline including 2D- and 3D U-nets along with multiple statistical shape model refinement steps. However, the comparison cannot be directly made as [13] worked on a smaller subset of the OAI data and predicted only 4 classes while we distinguished 7.

The MPUnet ranked 5th and 6th place in the first and second phases of the Medical Segmentation Decathlon respectively, in most cases comparing unfavorable only to significantly more compute intensive systems (see below).Footnote 2

The question arises how the performance of a 2D U-net with multi-planar augmentation compares to a U-net with 3D convolutions. Such 3D models are computationally demanding and typically need – in our experience – large training datasets to achieve proper generalization. While we are not making the claim that the MPUnet is universally superior to 3D models, we did find the MPUnet to outperform a 3D U-net of comparable topology, learning and augmentation procedure across multiple tasks including one for which the 3D model had sufficient spatial extent to operate on the entire input volume at once. We refer to the supplementary Table S.5 for details. We also found the MPUnet superior to both single 2D U-Nets trained on individual planes as well as ensembles of separate 2D U-Nets trained on different planes, see Tables S.2 & S.3 and Fig. S.1 in the supplementary material.

4 Discussion and Conclusions

The empirical evaluation over 13 segmentation tasks showed that multi-planar augmentation provides a simple mechanism for obtaining accurate segmentation models without hyperparameter tuning. With no task-specific modifications the MPUnet performs well across many non-pathological tissues imaged with various MR and CT protocols, in spite of the target compartments varying drastically in number, physical size, shape- and spatial distributions, as well as contrast to the surrounding tissues. Also the accuracies on the more difficult pathological targets are favorable compared to most other MSD contesters.

The MSD winning algorithm [14] relied on selecting a suitable model topology and/or cascade from an ensemble of candidates through cross-validation. In contrast to this and other top-ranking participants, we were interested to develop a task-agnostic segmentation system based on a single architecture and learning procedure that makes the system lightweight and easily transferable to clinical settings with limited compute resources.

That the MPUnet can be applied ‘as is’ across many tasks with high performance and its robustness against overfitting can be attributed to both the fully convolutional network approach, which is already known to generalize well, and our multi-planar augmentation framework. The latter allows us to apply a single 2D model with fixed hyperparameters, resulting in a fully autonomous segmentation system of low computational complexity. Multi-planar training improves the generalization performance in several ways: (1) Sampling from multiple planes allows for a huge number of anatomically relevant images augmenting the training data; (2) Exposing a 2D model to multiple planes takes the 3D nature of the input into account while maintaining the statistical and computational efficiency of 2D kernels; (3) The systematic augmentation scheme allows test time augmentation to be performed, which increases the performance through variance reduction if errors across views are uncorrelated for a given subject (visualized in supplementary Fig. S.2). This makes the MPUnet an open source alternative to 3D fully convolutional neural networks.