Abstract
Deep learning methods have gained increasing attention in addressing segmentation problems for medical images analysis despite the challenges inherited from the medical domain, such as limited data availability, lack of consistent textural or salient patterns, and high dimensionality of the data. In this paper, we introduce a novel multi-network architecture that exploits domain knowledge to address those challenges. The proposed architecture consists of multiple deep neural networks that are trained after co-aligning multiple anatomies through multi-metric deformable registration. This multi-network architecture can be trained with fewer examples and leads to better performance, robustness and generalization through consensus. Comparable to human accuracy, highly promising results on the challenging task of interstitial lung disease segmentation demonstrate the potential of our approach.
M. Vakalopoulou and G. Chassagnon—Authors with equal contribution.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Multi-network Architecture
- Multiple Deep Neural Networks
- Exploit Domain Knowledge
- Fully Convolutional Network (FCNs)
- Perform Data Augmentation
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Image segmentation is one of the most well studied problems in medical image analysis [6, 8]. Segmentation seeks to group together voxels corresponding to the same organ, or to the same tissue type (healthy or pathological). Existing literature can be classified into two distinct categories, model-free and model-based methods. Model-based methods assume the manifold of the solution space can be expressed in the form of a prior distribution, with sub-space approaches (e.g. active shapes), probabilistic or graphical models and atlas-based approaches being some representatives in this category. Model-free approaches on the other hand rely purely on the observation space combining image likelihoods with different classification techniques.
The emergence of deep learning as disruptive innovation method in the field of computer vision has impacted significantly the medical imaging community [13]. Numerous architectures have been proposed to address task-specific segmentation problems with the currently most successful technique being the Fully Convolutional Network (FCN) [11]. Additionally, FCNs have been combined with upsampling layers, creating a variety of networks [2, 9], and have been extended to 3D [7], boosting even more the accuracy of semantic segmentation.
The main challenges for deep learning in medical imaging arise from the limited availability of training samples – that is amplified when targeting 3D architectures –, the lack of discriminant visual properties and the three-dimensional nature of observations (high dimensional data). In this paper, we propose a novel multi-network architecture that copes with the above limitations. The central idea is to train multiple redundant networks fusing training samples mapped to various anatomical configurations. These configurations correspond to a representative set of anatomies and are used as reference spaces (frequently referred to as atlases). The mapping corresponds to a non-linear transformer. Elastic registration based on a robust, multi-metric, multi-modal graph-based framework is used within the non-linear transformer of the network. Training is performed on the sub-space and back-projected to the original space through a de-transformer that applies an inverse nonlinear mapping. The responses of the redundant networks are then combined to determine the optimal response to the problem.
The proposed framework relates also to the multitask learning paradigm (MTL), where disparate sources of experimental data across multiple targets are combined in order to increase predictive power. The idea behind this paradigm is that by sharing representations between related tasks, we can improve generalization. Even though an inductive bias is plausible in such paradigms, the implicit data augmentation helps reducing the effect of the data-dependent noise. The idea of MTL for image segmentation has been incorporated before, such as in deep networks [10] where soft or hard parameter sharing of hidden layers is performed, or in multi-atlas segmentation [6], where multiple pre-segmented atlases are utilized in order to better capture anatomical variation. As in most ensemble methods, the concept is that the combination of solutions by probabilistic inference procedures can offer superior segmentation accuracy.
AtlasNet differs from previous methods with respect to both scope and applicability. In (single or multi) atlas segmentation, the aim is to map a pre-segmented region of interest from a reference image to the test image, therefore applicability is limited to normal structures (e.g. organs of the body or healthy tissue) that exist in both images. Exploitability is further reduced in the case of multi-atlas segmentation due to the rareness of multiple atlases. The proposed strategy on the contrary is suitable also for semantic labeling of voxels (as part of healthy or pathological tissue) without the requirement of spatial correspondence between those voxels in atlas and test image.
AtlasNet uses multiple forward non-linear transformers that map all training images to common subspaces to reduce biological variability and a backward de-transformer to relax the effect of possible artificial local deformations. In fact, due to the ill-posedness of inter-subject image registration, regularization constraints are applied to derive smooth solutions and maintain topological relationships among anatomical structures. Consequently, image registration does not always produce a perfectly conforming diffeomorphism due to the nonexistence of a single atlas that matches all anatomies. The use of multiple spaces comes to reduce the atlas selection bias, while the backward transformation aims to balance the effect of possible alterations in local image texture due to the non-linearity in the transformation. Highly promising results comparable to human accuracy on the challenging task of interstitial lung disease (ILD) segmentation demonstrate the potential benefits of our approach. Furthermore, the obtained performance outreached redundant conventional networks.
Finally, the proposed approach (Fig. 1) addresses most of the limitations of existing neural network approaches. First, it requires fairly small number of training examples due to the reduced diversity of observations once mapped to a common anatomy. Second, it performs data augmentation in a natural manner thanks to the elastic mapping between observations and representative anatomies. Third, it inherits robustness, stability and better generalization properties for two reasons: the limited complexity of observations after mapping, and the “anatomically” consistent redundancy of the networks.
2 Methodology
The method consists of two main parts, a transformer and a de-transformer part. The former maps a sample S to N different atlases \(A_{i}, i \in \{1,\ldots ,N \}\), constructs their warped versions, and trains N different networks, while the latter projects back the N predictions to the initial space. These projections are then combined to obtain the final segmentation. The transformer part consists of a non-linear deformable operator (transformer \(T_i\)) and a segmentation network \(C_i\) while the de-transformer part uses the inverse deformable operator (de-transformer \(T^{-1}_i\)) to map everything back to the initial space of a sample S. The framework is flexible, enables any suitable transformation operator (with an existing inverse) to be coupled with a classifier.
2.1 Multimetric Deformable Operator
The multimetric deformable operator, responsible for mapping samples to different anatomies (atlases) therefore reducing variance and producing anatomically meaningful results, is an elastic image registration method that follows a context-driven metric aggregation approach [4] which aims to find the optimal combination of different similarity metrics. The operator is implemented using a deformable mapping from a source image S to a given atlas \(A_i\). Let us consider that a number of metric functions \(\rho _j\), \(j \in \{1,\ldots ,k \}\), can be used to compare the deformed source image and the target \(A_i\). The non-linear transformer T corresponds to the operator that optimizes in the domain \(\varOmega \) the following energy:
where \(w_j\) are linear constraints factorizing the importance of the different metric functions, and \(\psi ()\) is a penalty function acting on the spatial derivatives of the transformation as regularization to impose smoothness. Such a formalism can be considered either in the continuous setting that requires differentiable functions with respect to the metric functions \(\rho _j\) or in a discrete setting. The advantage of a discrete variant is that it can integrate an arbitrary number and nature of metric functions as well as regularizers while offering good guarantees concerning the optimality properties of the obtained objective function. Inspired by the work done in [5] we express the non-linear operator as a discrete optimization problem acting on a quantized version of the deformation space.
We used free form deformations as an interpolation strategy, invariant to intensity image metrics, pyramidal implementation approach for the optimization and belief propagation for the estimation of the optimal displacement field in the discrete setting. Details on the implementation can be found in [5].
2.2 Segmentation Networks
The segmentation networks \(C_i\) operate on the mapped image, \(T_i(S)\), to produce a segmentation map and can be the same or different depending on the task and the application and are completely independent of the exact classifier. After defining the optimal deformations \(T_i, i = 1 \ldots N\), between the source image and the different atlases in the transformer part, AtlasNet uses the inverse transformations to project back to the initial space of the source image S the predicted segmentation maps: \(S^{seg}_i = T^{-1}_i ( C_i(T_i(S)))\).
In this work, motivated by the state-of-the-art performance of FCNs in several problems we adapted them for dense labeling. We use the SegNet deep learning network [2] which performs pixelwise classification and is composed of an encoder and a decoder architecture and follows the example of U-net [9]. It consists of repetitive blocks of convolutional, batch normalization, rectified-linear units (ReLU) and indexed max-pooling layers, similar to the ones of the VGG16 network. For more details we refer to the original publication.
Different fusion strategies can be used for the combination of the segmentations. We used the probabilistic output of the classifiers (before hard decision) and fused the output of the different networks based on majority voting.
3 Implementation Details
For the registration, we used the same parameters for all images and all atlases. Three different similarity metrics have been used, namely, mutual information, normalized cross correlation and discrete wavelet metric. For the mutual information 16 bins were used, in the range of \(-900\) to 100.
We used the same parameters for training all SegNet networks (initial learning rate = 0.01, decrease of learning rate = \(2.5\cdot 10^{-3}\) every 10 epochs, momentum = 0.9, weight decay = \(5\cdot 10^{-4}\)). The training of a single network required around 16 hours on a GeForce GTX 1080 GPU, while the prediction for a single (volumetric) subject lasted only a few seconds. For data augmentation we performed only random rotations (between \(-10\) and 10\(^\circ \)) and translations (between 0 and 20 pixels per axis) avoiding local deformations since the anatomy should not artificially change. Moreover, for training, we performed median frequency balancing [2] to balance the data, as the samples with disease are considerably fewer than the rest of the samples.
4 Experimental Results and Dataset
We used as case study to evaluate our method the ILD segmentation in CT images because it is a challenging problem; boundaries are hard to detect and delineation suffers from poor-to-moderate interobserver agreement [3]. Moreover, although several visual scoring systems have been proposed for the disease, they only allow basic quantification of ILD severity. The dataset includes 17 (volumetric) CT images consisting of 6000 slices in total, each being of \(512 \times 512\) dimension, and annotations of lung and disease. The ILD annotation was performed by a medical expert by tracing the disease boundaries in axial view over all slices and used for training the classification model. Assessment of the method was performed on images from 29 additional patients being fully annotated only on selected CT slices (\(n=20\)) by three different observers. Note that the data was multi-vendor (GE & Siemens) and corresponds to the same moment of the respiratory cycle.
For all experiments we used 6 different atlases and registered both training and testing images to them. The choice of atlases was made by a radiologist towards integrating important variability of the considered anatomies. Our experimental evaluation has two objectives: (i) to show that AtlasNet provides more robust and accurate solutions compared to conventional networks and (ii) to examine whether the proposed methodology can truly be trained with fewer examples while leading to good performance. We used five metrics, namely sensitivity, precision, Hausdorff, average contour distance and dice coefficient (over the number of epochs), to evaluate the performance of the proposed method.
On the Number of Atlases: Figure 2a presents the behavior of our method using different number of atlases. It can be observed that the dice initially increases and tends to stabilize for more than 5 templates. Note that, even with the use of only one atlas the deformable operator of AtlasNet helps to increase the dice coefficient (from 0.533 to 0.604), as indicated by Fig. 2b and achieves the highest values of dice compared to conventional networks and usual data augmentation techniques.
On the Number of Training Samples: To evaluate the performance of our architecture with less samples we used a reduced number of samples (30%, 50% and 70% respectively) for the same number of epochs (18) and compare the performance with the one in [2]. The obtained mean dice coefficient values in [2] were 0.434, 0.462, 0.487, while for AtlasNet were 0.613, 0.646 and 0.672 respectively, indicating the robustness of AtlasNet with a significantly lower number of samples. In simple words, the proposed architecture produces better or similar results with 30% of the samples compared to the state-of-the-art architecture [2] with and without data augmentation.
Comparison with the State-of-Art: Although results on different datasets are not directly comparable, we compare our method with works related to ILD segmentation. Anthimopoulos et al. [1] classified CT image patches with ILD patterns using a CNN and obtained 0.856 accuracy for 6 disease classes. By extracting patches on our data (where different patterns are annotated as a single class) in the same way as in [1] we obtained 0.916 accuracy. In [12] a patch-based CNN was augmented with a deep encoder-decoder to exploit partial annotations. By applying AtlasNet on the same dataset as in [12], we increased the mean dice from 0.671 to 0.725.
Moreover we compared AtlasNet with respect to disease segmentation with standard frameworks (without registration and with or without data augmentation) for the same number of epochs (18) and illustrate results in Table 1 and Fig. 2b. For equal comparison, we assessed accuracies using the same classification strategy [2] trained on the initial CT slices, and after performing data augmentation as described earlier. The proposed method reports the best accuracy with respect to Hausdorff distance, average contour distance and dice, indicating that the disease segmentation is much more accurate than by the conventional frameworks with or without data augmentation. This can be inferred also from Fig. 3 where axial slices of two different subjects are depicted. It is clear that the proposed approach segments accurately the boundaries of the disease.
For a more complete evaluation, we compare AtlasNet also with inter-observer agreement using the annotations of three different medical experts. In particular, the annotations of one observer have been used as ground truth to evaluate the rest. From Table 1 and Fig. 3, it can be observed that AtlasNet demonstrates more robust performance than manual segmentation. Finally, it is worth mentioning that even if the network operates on 2D slices, without accounting for out-of-slice connections, the fusion of the different atlases’ predictions makes the final segmentation smooth across all three axes.
Concerning the computational resources, we use a single segmentation network [2] for each of the N atlases, therefore the time and memory usage for one atlas is that of the CNN, while we also showed that a small N (such as 6) suffices. For segmentation of one volumetric CT on a single GPU the total testing time (using 6 atlases) is 3–4 min, including the registration step while the registration cost is negligible since a graph-based GPU algorithm is used taking 3–5 s per subject. This cost drops linearly with the number and computing power of GPUs. Thus, we believe that the additional complexity of AtlasNet is fully justified, since it improves performance by more than 20% and also maintains it stable with only 30% of the training data compared to conventional single networks.
5 Conclusion
In this paper, we present a novel multi-network architecture for (healthy or pathological) tissue or organ segmentation that maximizes consistency by exploiting diversity. Evaluation of the method on interstitial lung disease segmentation highlighted its advantages over previous competing approaches as well as inter-observer agreement. The investigation of techniques for soft parameter sharing of hidden layers, and information transfer between the different networks and atlases is our direction for future work. Finally, the extension to multi-organ segmentation including multiple classes and loss functions is one of the potential directions of our method.
References
Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S.: Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans. Med. Imaging 35(5), 1207–1216 (2016)
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE PAMI (2017)
Camiciottoli, G., et al.: Lung CT densitometry in systemic sclerosis: correlation with lung function, exercise testing, and quality of life. Chest 131(3), 672–681 (2007)
Ferrante, E., Dokania, P.K., Marini, R., Paragios, N.: Deformable registration through learning of context-specific metric aggregation. In: Wang, Q., Shi, Y., Suk, H.-I., Suzuki, K. (eds.) MLMI 2017. LNCS, vol. 10541, pp. 256–265. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67389-9_30
Glocker, B., Komodakis, N., Tziritas, G., Navab, N., Paragios, N.: Dense image registration through MRFs and efficient linear programming. Med. Image Anal. 12(6), 731–741 (2008)
Iglesias, J.E., Sabuncu, M.R.: Multi-atlas segmentation of biomedical images: a survey. Med. Image Anal. 24(1), 205–219 (2015)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Paragios, N., Ferrante, E., Glocker, B., Komodakis, N., Parisot, S., Zacharaki, E.I.: (Hyper)-graphical models in biomedical image analysis. Med. Image Anal. 33, 102–106 (2016)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Ruder, S.: An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098 (2017)
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)
Vakalopoulou, M., Chassagnon, G., Paragios, N., Revel, M., Zacharaki, E.: Deep patch-based priors under a fully convolutional encoder-decoder architecture for interstitial lung disease segmentation. In: 2018 IEEE International Symposium on Biomedical Imaging (ISBI) (2018)
Zhou, S., Greenspan, H., Shen, D.: Deep Learning for Medical Image Analysis. Academic Press, Cambridge (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Vakalopoulou, M. et al. (2018). AtlasNet: Multi-atlas Non-linear Deep Networks for Medical Image Segmentation. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11073. Springer, Cham. https://doi.org/10.1007/978-3-030-00937-3_75
Download citation
DOI: https://doi.org/10.1007/978-3-030-00937-3_75
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00936-6
Online ISBN: 978-3-030-00937-3
eBook Packages: Computer ScienceComputer Science (R0)