Introduction

Body composition analysis aims to non-invasively categorize and quantify metabolically relevant tissues like fat, muscle, or bones. Over the years, radiological imaging data have become one of the obvious sources for such an analysis. More recently, convolutional neural networks have received widespread attention for automated image segmentation and tissue quantification due to the prospect of a substantially higher efficiency over manual and semiautomated methods [1].

Obesity is defined as the abundance of ectopic abdominal or subcutaneous adipose tissue. It is strongly associated with a variety of other diseases like diabetes, coronary heart disease, metabolic syndrome, and many types of cancer [24]. Body composition and especially obesity are commonly characterized by measuring body mass index or bioelectrical impedance. These measures, however, may not resolve individual fat depots, which is crucial for proper phenotyping [2, 5]. Likewise, dual-energy X-ray absorptiometry (DEXA) is only approximate, operator-dependent, and requires good patient compliance. With their high anatomical resolution in three dimensions, tomographic imaging techniques have become a de facto standard for the quantification of body fat. Computed tomography (CT) and magnetic resonance imaging (MRI) can therefore help to identify specific phenotypes of obesity.

In contrast to subcutaneous adipose tissue (SAT), visceral fat (VAT) has a distinct metabolic role [6, 7] and is strongly associated with a metabolic syndrome [8]. In comparison with CT, MRI provides the best soft tissue contrast but is less available and more demanding (time and effort). For prospective studies, MRI is preferred due to the lack of ionizing radiation and the overall lower risk profile. Fat amounts on MRI are usually identified by their high signal intensity in T1-weighted images and quantified by a number of approaches. Contrast agent is usually not required for body composition studies. Manual contouring [9] is often considered the reference standard but suffers from long processing times [10]. Semiautomated or supervised methods will save some time by automated computation of an approximate segmentation that is visually inspected and interactively adjusted in case of errors [1012].

Fully automated methods are faster and eliminate any interreader variability [1113] but may not necessarily provide the most accurate result [14]. They are typically based on geometrical modelling and image processing techniques like thresholding (e.g., fixed signal intensities for tissue discrimination), morphological operators, and clustering (e.g., the definition of background and foreground). An artificial neural network is trained with proper reference annotations from expert readers and might therefore meet the requirements of both speed and accuracy.

Over the past years, deep learning techniques have shown promising results promoting automation and personalization in a wide array of medical applications [15,16,17,18,19]. For complex medical image processing and analysis tasks including body composition assessment, convolutional neural networks have become a major method of choice [20, 21]. Tissue composition and distribution are analyzed via image segmentation techniques such as fully convolutional networks (FCNs) [22, 23]. Their hierarchical encoder-decoder structure identifies spatial associations in the input images and generates high-resolution segmentation masks. Accordingly, FCNs are also promising candidates for the quantification of abdominal fat tissue [24,25,26]. Variations of the UNet architecture have already proven to be highly suitable for medical image segmentation in general [15].

This work aimed to assess the performance of supervised deep learning-based methods for adipose tissue quantification from MR images—using three different FCNs based on the UNet architecture—and compare the results with a ground-truth semiautomated approach. It is characterized by the combination of a relatively large number of MRI datasets, full abdominopelvic coverage, and a deliberate selection of patients with obesity.

Material and methods

Patients and MRI

Whole abdominal MRI data were available from an IRB-approved single-center study at an academic research institution (Integrated Research and Treatment Center AdiposityDiseases, Leipzig University Medicine, Leipzig, Germany) investigating the long-term effects of strength versus endurance training on the cardiometabolic risk factors for patients with obesity (BMI ≥ 35 kg/m2) (trial registration number: NCT01435057). The dataset involved 331 MRI examinations and 12,422 abdominal MRI slices.

All patients had been examined in a 1.5-T MRI system (Achieva XR, Philips Healthcare) in a supine position using the integrated whole-body coil for signal reception. The main pulse sequence for fat quantification was a dual-echo gradient echo with these parameters: 50 transverse slices (two stacks covering the abdominopelvic region between the diaphragm and pubic symphysis), slice thickness 10 mm, interslice gap 0.5 mm, echo times 2.3 ms (opposed phase) and 4.6 ms (in phase), repetition time 76 ms, flip angle 70°, field of view 530 mm × 530 mm, acquisition matrix 216 × 177 and reconstruction matrix 480 × 480.

Semiautomated segmentation

Reference segmentation was provided by two experienced readers who annotated all abdominal MRI slices with SAT and VAT amounts. A purely manual segmentation of all MRI slices (12,422) was not considered for this task because of the immense amount of time and effort. Instead, an in-house software framework, DicomFlex [16], was used for semiautomated segmentation of SAT and VAT amounts (area and volume) from these T1-weighted gradient-echo MR images using information from both in-phase and out-of-phase images. This method involved the computation and manual supervision of SAT and VAT regions of interest (ROI) as well as the supervised definition of a threshold for the histogram of MRI signal intensities separating fat and nonfat amounts within the VAT ROI [12]. Overall, this typically involved 30–40 individual slices and required 15 to 20 min for the whole dataset—roughly 30 s per slice.

This approach generated some regional misclassifications, for example, visceral fat inside the kidneys or intestines, because tissues showed a slight overlap for MRI signal intensities near the threshold. False visceral fat amounts are conspicuous in the images but their contribution to the overall VAT amount is typically very small. An experienced reader carefully inspected these regions and dynamically adjusted the threshold to visually balance between extra and omitted visceral fat amounts. In the following, the term fat quantification refers to methods involving image segmentation.

FCN architectures and training

Three FCN architectures for fully automated SAT and VAT segmentation were implemented: UNet [17], DenseUNet [18, 19], and CDFNet [20, 21]. They all process input images with a chosen resolution of 256 × 256 pixels and generate segmentation maps of the same dimension. The dataset was composed of T1-weighted (in-phase) MR images as model input and corresponding ground truth segmentation maps as a target. All three models were evaluated using a five-fold cross-validation scheme with training, validation, and test subsets, respectively. Each FCN was trained in a supervised manner using the training subsets. The evaluation was carried out on the test subsets, while the validation sets were used to prevent overfitting. Using the described cross-validation scheme, FCN performance could be assessed on the entire dataset. Further technical details and flow charts of the architectures are provided in the supplementary information.

Evaluation metrics

The pixelwise agreement between the adipose tissue segmentations of the FCN models and the ground truth was evaluated with the accuracy metric and the Dice similarity coefficient. The quantification of adipose tissue volumes was validated with a selection of aggregation metrics, each highlighting different aspects of the prediction performance. The Pearson correlation coefficient is a widely used measure of linear correlation and was used here to verify a strong correlation between true and predicted fat volumes. The mean percentage error is used to estimate the systematic error (bias) of the predictor. In addition, the standard deviation of the relative differences between true and predicted volumes provides the variational error of the predictions. Both error contributions may be combined to the root mean square percentage error. The similarity of the distributions between ground truth and predicted adipose tissue volumes may be estimated with the second Wasserstein distance. For better comparison with the error metrics, relative differences were also used here. Finally, the excess kurtosis was computed to assess the contribution of outliers (here, severely false predictions) to the aggregated metrics. The functional forms of the metrics used in this work may be found in the supplementary information.

Results

Table 1 summarizes the agreement between predicted (\({\mathbf{P}}_{\mathbf{i}}\)) and ground-truth (\(\mathbf{T}\)) adipose tissue segmentation maps using the described cross-validation scheme. Each FCN architecture was trained with and without data augmentation. SAT classifications were generally more accurate than VAT predictions. The agreement with the ground truth was marginally higher when augmented data was added during training. This effect was more pronounced for VAT. Pixelwise similarity measures for DenseUNet and CDFNet are slightly higher than their counterparts corresponding to the vanilla UNet architecture. Figure 1 shows two sample MR images after CDFNet (\({\mathbf{P}}_{6})\) segmentation in comparison with the ground truth.

Table 1 Average pixel-wise segmentation performance of the FCN models in cross-validation
Fig. 1
figure 1

SAT and VAT segmentation for two sample MRI slices. The left column shows the ground truth \(\mathbf{T}\), the center column the \({\mathbf{P}}_{6}\) prediction (obtained with the CDFNet and augmented training data), and the right column displays VAT agreement between the two annotations. TP: true positive; TN: true negative; FP: false positive; FN: false negative

Table 2 provides an overview of the performance of the FCN models in adipose tissue quantification. Strong agreement was found between the predicted and ground truth SAT volumes, as revealed by a Pearson correlation of 0.999 and a 1–2% RMSPE largely comprised of variation error. In comparison, FCN model predictions of VAT volumes showed correlation coefficients of up to 0.997 and the lowest RMSPE of 3.2%. Figure 2 shows the Bland–Altman plots of \({\mathbf{P}}_{6}\) adipose tissue volume predictions. Bland–Altman plots for total adipose tissue volume (TAT) and VAT/SAT volume ratio can be found in the supplementary material.

Table 2 Patient-wise fat quantification performance metrics of the FCN models \({{\varvec{P}}}_{{\varvec{i}}}\) in cross-validation
Fig. 2
figure 2

Bland–Altman plots for SAT (top), VAT (middle), and TAT (bottom) volumes between P6 and T. The fat volumes \({\mathrm{V}}_{\mathrm{pred}}\) in \(\mathrm{L}\) (liters) were obtained from the predicted SAT and VAT segmentations P6, while the volumes \({\mathrm{V}}_{\mathrm{true}}\) were derived from the ground truth annotations T. Bland–Altman plots were annotated with mean relative difference \(\upmu\) (black line) and 95% confidence range (light gray area)

Table 3 summarizes the variability for the best FCN predictions (\({\mathbf{P}}_{6}\)) and ground truth for both SAT and VAT (cross-validation with 331 patients) as well as that between two readers (SAT, selection of 29 patients). For reference purposes only, corresponding values from an independent previous assessment are also given. The average fat segmentation times for one MRI slice are reported in Table 4. With the FCN models, whole-abdominal adipose tissue quantification, assuming a few tens of MRI slices per patient, are on the order of a few seconds on a PC with a powerful graphic card. On a standard “office” PC, processing may take 1–2 min. For comparison, semiautomatic segmentation required 15 to 20 min for the whole dataset–roughly 30 s per slice.

Table 3 Variability of various methods for fat quantification
Table 4 Average compute times of one forward pass (one MRI slice) of the FCN during inference (segmentation of one image with 256 × 256 resolution) on CPU (Intel Xeon E5-2637 v2) or GPU (Nvidia GeForce GTX TITAN Black). For comparison, segmentation of a single slice with the semiautomated method (used for ground truth annotations) required roughly 30 s per slice

Discussion

Adipose tissue volume is a common biomarker for various clinical outcomes. This work evaluated to what extent fully convolutional networks may serve to automatically segment and quantify abdominal adipose tissue from MR images. Whole-abdominal image data of 331 study patients were annotated for SAT and VAT amounts using a semiautomated segmentation method. Three deep learning architectures, UNet and its derivatives DenseUNet and CDFNet, were trained on the original patient data with additional samples generated by randomized image transformations. All models were evaluated for segmentation and fat quantification performance in cross-validation.

Although variation error remains the biggest source of deviation, all models tend to slightly underestimate VAT volume likely due to class imbalance (i.e., commonly smaller VAT volumes as compared to SAT and remaining tissue). For both types of adipose tissue, the low Wasserstein distance indicates the preservation of the ground-truth volume distribution. The moderate excess kurtosis suggests a relatively consistent quantification performance among the samples.

Overall, data augmentation with tuned hyperparameters improved model performance. DenseUNet and CDFNet turned out to be more efficient than the original UNet implementation with similar or better results (despite using fewer parameters) at the cost of increased processing time. The highest evaluation measures for adipose tissue volume analysis were achieved by the CDFNet trained on the original and augmented data (\({\mathbf{P}}_{6}\)).

The FCN-based methods showed excellent SAT segmentation and quantification accuracy, and the corresponding agreement for VAT was very good. A comparison with the inter-rater agreement from user-guided quantification showed the superiority of the FCN models for SAT, and similar to higher performance for VAT as opposed to manual annotation alone or supervised image processing.

Agreement between methods was generally higher for SAT than for VAT, which is attributed to the high variability of VAT with respect to distribution, size, and shape. VAT amounts need to be carefully distinguished from surrounding structures and organs, such as the bowel, pancreas, or urinary bladder [2224]. VAT quantification is already challenging for human segmentation as indicated by the lower inter-rater reliability [12].

The intraclass correlation and coefficient of variation of the FCN approach are also in line with the variability reported between two expert readers using an earlier Matlab tool (similar semiautomated approach) as well as a widely used commercial software (manual segmentation) [12]. A more detailed assessment is not possible here, because the above parameters were obtained by different readers on different subjects.

With FCN processing times between a few seconds and about a minute (depending on the hardware), whole-abdominal fat amounts (tens of slices) may be segmented practically on the fly [25]. As of now, our FCN-based approach is at least one order of magnitude faster than the semiautomated method used for ground truth T segmentation here (15–20 min per patient for 30–40 slices). These results demonstrate the promising reliability of FCN approaches in the quantification of adipose tissue compartments.

One of the first reports on the automated segmentation of visceral and subcutaneous adipose tissue featured a very small sample size (around 40) and showed good agreement with Dice coefficients between 0.82 and 0.92 [26]. In a later work, a higher level of agreement (> 0.95) was reached with an improved neural network [27]. That study was limited by a small amount of training data and lacked external validation. Proper external validation was performed in a more recent work but the sample size remained low with only 20 cases [28].

In general, the lack of sufficiently large datasets is still regarded as a major limitation of current approaches [29]. An open-source design might assist with the distribution and acceptance of the methods [30]. Modern implementations have reached a new level of processing speed with times on the order of 1 min [21]. Whole-body adipose tissue analyses are also offered as commercial services, but the associated fees might limit their use to smaller batches [31].

This work has a number of limitations. As already mentioned in the methods section, the ground truth for adipose tissue volumes was not established by manual segmentation due to time constraints. We accepted the disadvantage of local misclassifications (see Fig. 1, bottom row), which were “learned” by the FCN, and decided to take advantage of a much higher number of patients with reference data.

On a technical level, our supervised learning methods focus on point estimation of adipose tissue volumes only. The presented FCN has no means of reporting its reliability on out-of-distribution data and should always be professionally supervised by an expert reader.

Various methods have been proposed for uncertainty estimation for deep learning in general and body composition analysis in particular [28, 32], which should be considered for future research. Additionally, ground truth annotations should ideally be made by several independent observers to estimate the uncertainty of label generation as well.

The trained model should be able to extrapolate to similar cohorts and imaging systems without any retraining. Even within a given cohort, data outliers may arise from different acquisition parameters, imaging artifacts, or rare phenotypes. One of the key challenges of deep-learning approaches is the availability of proper training data. Here, methods using generative adversarial learning schemata [33, 34] or spatially aware optimization strategies may be employed to improve generalization to unseen data. Future work needs to analyze the performance of the trained fat quantification models on external cohorts.

In conclusion, this work demonstrates that deep-learning approaches for adipose tissue quantification from MRI data are also feasible for patients with obesity. The resulting accuracy was equal to or better than that of operator-driven approaches with processing requiring substantially less time and effort.