Automatic segmentation of the thalamus using a massively trained 3D convolutional neural network: higher sensitivity for the detection of reduced thalamus volume by improved inter-scanner stability

Objectives To develop an automatic method for accurate and robust thalamus segmentation in T1w-MRI for widespread clinical use without the need for strict harmonization of acquisition protocols and/or scanner-specific normal databases. Methods A three-dimensional convolutional neural network (3D-CNN) was trained on 1975 T1w volumes from 170 MRI scanners using thalamus masks generated with FSL-FIRST as ground truth. Accuracy was evaluated with 18 manually labeled expert masks. Intra- and inter-scanner test-retest stability were assessed with 477 T1w volumes of a single healthy subject scanned on 123 MRI scanners. The sensitivity of 3D-CNN-based volume estimates for the detection of thalamus atrophy was tested with 127 multiple sclerosis (MS) patients and a normal database comprising 4872 T1w volumes from 160 scanners. The 3D-CNN was compared with a publicly available 2D-CNN (FastSurfer) and FSL. Results The Dice similarity coefficient of the automatic thalamus segmentation with manual expert delineation was similar for all tested methods (3D-CNN and FastSurfer 0.86 ± 0.02, FSL 0.87 ± 0.02). The standard deviation of the single healthy subject’s thalamus volume estimates was lowest with 3D-CNN for repeat scans on the same MRI scanner (0.08 mL, FastSurfer 0.09 mL, FSL 0.15 mL) and for repeat scans on different scanners (0.28 mL, FastSurfer 0.62 mL, FSL 0.63 mL). The proportion of MS patients with significantly reduced thalamus volume was highest for 3D-CNN (24%, FastSurfer 16%, FSL 11%). Conclusion The novel 3D-CNN allows accurate thalamus segmentation, similar to state-of-the-art methods, with considerably improved robustness with respect to scanner-related variability of image characteristics. This might result in higher sensitivity for the detection of disease-related thalamus atrophy. Key Points • A three-dimensional convolutional neural network was trained for automatic segmentation of the thalamus with a heterogeneous sample of T1w-MRI from 1975 patients scanned on 170 different scanners. • The network provided high accuracy for thalamus segmentation with manual segmentation by experts as ground truth. • Inter-scanner variability of thalamus volume estimates across different MRI scanners was reduced by more than 50%, resulting in increased sensitivity for the detection of thalamus atrophy. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-022-09170-y.


Introduction
There is growing interest in MRI-based volumetry of the thalamus, for example, in the management of patients with multiple sclerosis (MS). The thalamus is among the brain structures with the earliest signs of atrophy detectable in MRI in the course of MS [1], and MRI-based thalamus volume is a promising marker to predict the transition from clinically isolated syndrome to clinically definite MS [2]. Furthermore, thalamus atrophy is among the strongest predictors of cognitive impairment in MS [3] and therefore can serve as a surrogate outcome for cognition in MS therapy trials [4,5]. Manual delineation by an expert is the ground truth for MRI-based volumetry of the thalamus. Methods for automatic thalamus segmentation have been developed to facilitate thalamus volumetry [6][7][8][9], as manual segmentation by an expert is not compatible with the busy clinical routine at most sites. However, automatic brain volumetry methods are sensitive to the MRI scanner platform and details of the acquisition protocol [10][11][12][13][14]. This limits the sharing of normal databases and/or cutoffs between sites and/or scanners, which in turn detracts from the utility of automatic brain volumetry for widespread clinical use. Thus, there is a great need for automatic brain volumetry methods that are more robust with respect to scanner-related variability of image characteristics in MRI (e.g., gray-to-white matter contrast, signal-to-noise).
Convolutional neural networks (CNNs) outperform conventional machine learning approaches in many medical imaging tasks including MRI-based brain volumetry [15,16]. CNNs are also particularly promising in non-harmonized multi-site settings, since they can be made robust with respect to scanner-related variability by training with a heterogeneous dataset covering the whole range of image characteristics encountered in the considered multi-site setting [17].
Against this background, the present study trained a 3D-CNN for automatic thalamus segmentation in T1w-MRI on a large heterogeneous dataset. The 3D-CNN was compared with a publicly available 2D-CNN (FastSurfer [16]) and FSL [18] with respect to accuracy, intra-and inter-scanner test-retest stability, and sensitivity to detect thalamus atrophy.

Materials and methods
All datasets used in this study comprise 3D gradient echo T1w volume images of the brain acquired with scanner-specific acquisition protocols recommended by the manufacturer for MRI-based brain volumetry. Slice thickness ranged between 0.9 and 1.3 mm. TE time ranged between 1.7 and 5.1 ms (mean 3.1 ± 0.6 ms). TR time ranged between 4.7 and 25 ms (mean 10.1 ± 6.0 ms) for GE and Philips scanners, and between 4.5 and 3000 ms (mean 1837 ± 459 ms) for Siemens scanners.
A summary of the datasets is provided in Table 1.
Dataset for the training of the 3D-CNN

Normal database for single subject analysis
A sample of 4872 T1w-MRI from 4872 different patients acquired on 160 different MRI scanners for unspecific symptoms (headache, dizziness) was used as normal database for single subject analyses. None of the patients had a history of or currently ongoing neurological or psychiatric condition. All volume images were free of abnormalities beyond those expected for the patients' age based on visual inspection by the local radiologist.

Ethics approval and consent to participate
The MRI data of the training set and of the normal database had been transferred to jung diagnostics GmbH under the terms and conditions of the European General Data Protection Regulation for remote image analysis. Subsequently, the data had been anonymized. The need for written informed consent for the retrospective use of the anonymized data in the present study was waived by the ethics review board of the general medical council of the state of Hamburg, Germany.
Ethics approval for the retrospective use of the FTHP dataset was obtained from the same ethics review board. The single subject had given written informed consent.
The MS patient dataset comprised data from two prospective studies that had been approved by the local ethics committees. All patients had given written informed consent. This included the retrospective use of the data for the present study.

Thalamus segmentation with FSL
The FIRST module of the FSL Software (FSL; version 6.0.2; http://fsl.fmrib.ox.ac.uk/fsl) provides binary masks of deep gray matter structures including the thalamus [18].

3D-CNN for thalamus segmentation
A 3D-CNN-U-net architecture [19,20] recently introduced for the segmentation of white matter hyperintensities in T2w-MRI [15,21] was used (Fig. 1). The network was trained for the simultaneous segmentation of the right thalamus, left thalamus, remaining deep gray matter structures, and remaining total intracranial volume (TIV) (Fig. 1). The additional classes (beyond the left and right thalamus) were introduced to provide regional context which has been proven beneficial for other segmentation tasks [22].
The ground truth for the deep gray matter structures was obtained with FSL. The ground truth for the TIV was derived using a validated algorithm [23] based on the Statistical Parametric Mapping framework (version SPM12, https://www. fil.ion.ucl.ac.uk/spm/software/spm12/) [24]. The 3D-CNN was trained using massive data augmentation including simulation of random bias fields and statistical noise in addition to standard augmentation techniques (rotation, flipping). Manual quality control of ground truth segmentation was not performed.
A more detailed description of the 3D-CNN architecture, the training, and the data augmentation is given in the Supplementary Material. The trained 3D-CNN is available from the authors upon request under a non-disclosure agreement for non-commercial use.

Segmentation accuracy
The accuracy of bilateral thalamus segmentation by 3D-CNN, FastSurfer, and FSL in the IBSR dataset was characterized by the Dice similarity coefficient and the Hausdorff distance relative to the manual expert delineation. Analysis of variance for repeated measurements was used to compare the Dice coefficient and the Hausdorff distance between 3D-CNN, FastSurfer, and FSL. Bland-Altman plots were used to compare bilateral thalamus volume (THALV) estimates between the automatic methods and the manual ground truth.

Intra-and inter-scanner variability
Intra-and inter-scanner variability of THALV were assessed in the FTHP dataset. To measure intra-scanner variability, the residuals with respect to the mean THALV of all repeat scans with the same scanner (without repositioning) were computed for each THALV estimate. The intra-scanner residuals were pooled into a single vector (of length 477). To measure interscanner variability, first the mean values of THALV were computed for each scanner and pooled into a single vector (of length 123). Then, the mean overall mean values were subtracted element-wise from this vector, resulting in a vector of interscanner residuals with zero mean. Levene's test for equality of variances was used for pairwise comparison of the variance of intra-and inter-scanner residuals between 3D-CNN, FastSurfer, and FSL. Bonferroni correction was used to adjust the significance level for the number of pairwise tests; that is, p values smaller than 0.05/3 = 0.017 were considered statistically significant. The 50th and 95th percentiles of the absolute value of intra-and inter-scanner residuals were determined to further characterize intra-and inter-scanner variability.
Single subject analysis Regional brain volumes including THALV are strongly correlated with head size [26] and subject age [27]. The TIV estimated in T1w-MRI using a 3D-CNN specifically trained for this purpose was used as surrogate of head size. Removing between-subjects variability associated with TIV and age can improve the power of MRI-based brain volumetry to detect disease-related alterations. In the present study, the following procedure was used to account for the impact of TIV and age on THALV. First, the THALV-TIV-age relationship was Fig. 1 The proposed 3D-CNN U-net: A patch-wise approach with a patch size of 160 × 160 × 160 cubic voxels of 1-mm edge length was used. A fully convolutional encoder-decoder architecture with 3D convolutions, residual-block-connections, and four reductions of the feature map size was employed. The network was trained for simultaneous segmentation of the left and right thalamus, remaining deep gray matter structures, and remaining total intracranial volume modeled with THALV = a * TIV 2 + b * age 2 + c * TIV * age + d * TIV + e * age + f by minimizing the sum of squared differences in the normal database. Then, the residual of THALV with respect to this model, denoted resTHALV, was computed for each subject in the normal database, that is, resTHALV = THALV − (a * TIV 2 + b * age 2 + c * TIV * age + d * TIV + e * age + f) [28]. Since regression can be affected by outliers, a two-step approach was used. After the first regression, outliers were identified and removed from the second and final regression. Subjects with resTHALV < lower quartile − 1.5*inter-quartile range of resTHALV in the normal database or resTHALV > upper quartile + 1.5*inter-quartile range were considered outliers. The 95% confidence interval (95% CI) of resTHALV was computed as [−1.96*std, +1.96*std], where std is the standard deviation of resTHALV relative to the final THALV-TIV-age model in the normal database excluding outliers.
Modeling of the THALV-TIV-age relationship in the normal database was performed separately for THALV estimates from 3D-CNN, FastSurfer, and FSL. The same TIV estimate was used for the three thalamus segmentation methods.
For single subject analysis of THALV in MS patients, a twosample test approach was used, which is more conservative than a one-sample test approach [29]. More precisely, the residual resTHALV of the patient's THALV was computed with respect to the THALV-TIV-age model (obtained in the normal database) using the age and the individual TIV estimate for the patient. The 95% CI of the patient's resTHALV was approximated by the 95% CI of the inter-scanner variability in the FTHP. The patient's THALV was considered reduced if resTHALV including its 95% CI was below the 95% CI of resTHALV in the normal database ( Supplementary Fig. 1) [30]. The proportion of MS patients with reduced THALV was compared between 3D-CNN, FastSurfer, and FSL.

Results
Each of the two steps for the 3D-CNN training took approximately 2 days using a 2.4GHz CPU (Intel Xeon Silver 10-Core) with a GPU Quadro P5000 with 16GB memory. Computation time for thalamus segmentation was approximately 1 min with 3D-CNN and FastSurfer (4GB GPU memory needed), and approximately 5 min with FSL.
Visual inspection did not reveal clear failures of thalamus segmentation in any of the test cases (IBSR, FTHP, and MS dataset) with any of the automatic methods.

Intra-and inter-scanner variability
Results on inter-and intra-scanner variability are summarized in Table 3 and Fig. 5. The variance of the intra-scanner residuals of THALV in the FTHP dataset was significantly smaller for 3D-CNN (std = 0.08 mL, p < 0.0005) and FastSurfer (std = 0.09 mL, p = 0.012) compared to that in FSL (std = 0.15 mL). The reduction of the intra-scanner residuals with 3D-CNN compared to FastSurfer missed the Bonferroni-corrected significance threshold (p = 0.023 > 0.017). The variance of the interscanner residuals of THALV was significantly smaller for 3D-

Single subject analysis
Scatter plots of TIV-and age-corrected resTHALV versus age in the normal database are shown in Fig. 6. The standard deviation of resTHALV in the normal database was 0.88 mL, 1.09 mL, and 1.15 mL for 3D-CNN, FastSurfer, and FSL ( Table 4). The proportion of MS patients with reduced THALV was 24%, 16%, and 11% for 3D-CNN, FastSurfer, and FSL (Table 4; Fig. 6).

Discussion
Thalamus segmentation with the 3D-CNN was similarly accurate as with FastSurfer and FSL, but it was considerably more robust against repeat scanning, particularly against repeat scanning on different MRI scanners: inter-scanner variability of the estimated thalamus volume was lower by more than a factor two with the 3D-CNN (0.28 mL; Table 3) than with FastSurfer (0.62 mL) and FSL (0.63 mL). This might be explained by the large size and the heterogeneity of the training dataset comprising 1975 T1w-MRI of 1975 different patients from 170 different MRI scanners. For comparison, FastSurfer was trained on 160 T1w-MRI [16]. Furthermore, massive data augmentation was used for 3D-CNN training including adding random bias fields and noise in addition to standard augmentation. This might have forced the 3D-CNN to focus on the relevant features rather than tuning results too strongly to specific image characteristics of a given scanner or acquisition sequence.
Reduction of intra-and inter-scanner variability might be achieved not only by reduction of variance of no interest (associated with scanner-related variability of image characteristics) but also by reduction of variance of interest associated with actual between-subjects variability (a method that simply returns a thalamus volume of 15 mL for all subjects performs perfectly well  with respect to intra-and inter-scanner variability, but clearly is useless). The higher proportion of patients with significantly reduced thalamus volume in the MS dataset when using the 3D-CNN (24%) compared to FastSurfer (16%) and FSL (11%) suggests that this is not the case with the novel 3D-CNN. It rather suggests that the improved intra-and inter-scanner stability of the 3D-CNN resulted in increased sensitivity for the detection of thalamus atrophy in MS [31,32]. This is further supported by the fact that the disease severity in the MS patients was significantly correlated with the THALV estimates from the 3D-CNN and from FastSurfer but not with those from FSL (section "Impact of the segmentation method on the correlation between the thalamus volume and disease severity in MS" in the supplementary material). However, the correlations were weak (Pearson correlation coefficient ≤ 0.29) and therefore should be interpreted cautiously. The low strength of the observed correlations is in line with previous studies that reported either a significant weak correlation between EDSS and thalamus volume (e.g., Datta et al: r = −0.133, p < 0.001, 924 MS patients [31]) or lack of significant correlation (e.g., Tommasin et al: Spearman correlation coefficient = 0.07, p ≥ 0.01, 163 MS patients [33]). Intra-and inter-scanner variability of MRI-based regional brain volume estimates are often reported as absolute (in mL) or relative differences (in %) [12,13]. In the context of clinical applications of MRI-based brain volumetry, in which individual volume estimates are compared to a normal database, it might be more appropriate to specify intra-and inter-scanner variability in relation to the between-subjects variability in the normal database. Figure 6 demonstrates that the inter-scanner variability of the thalamus volume estimates in the FTHP dataset covers a significant fraction of the 95% CI of between-subjects variability in the normal database. This suggests that inter-scanner variability of thalamus volume estimates significantly contributes to the between-subjects variability in a normal database comprising scans from different scanners. Thus, reduction of inter-scanner variability results in relevant reduction of normal between-subjects variability. This is highly relevant, because it is expected to result in greater power for the detection of disease-related alterations. The ratio of inter-scanner variability (in the FTHP dataset) to between-subjects variability in the normal database was 0.28/ 0.88 = 0.32 for 3D-CNN, 0.62/1.09 = 0.57 for FastSurfer, and 0.63/1.15 = 0.55 for FSL (Tables 3 and 4). The larger fraction of MS patients with reduced thalamus volume according to the 3D-CNN estimates is in line with the lower inter-scannerto-between-subjects variability ratio for the 3D-CNN.
In clinical applications of MRI-based brain volumetry, reliable detection of disease-related alterations is more important than estimation of the actual volume of the brain region of interest with highest possible accuracy (but lower precision). However, the 3D-CNN provides also high accuracy for thalamus segmentation as indicated by the mean Dice coefficient of 0.86 relative to manual expert delineation. Sitter and coworkers reported Dice coefficients between repeat manual  thalamus segmentation by the same expert ranging from 0.87 to 0.91 [34]. The 3D-CNN resulted in some overestimation of the volume of small thalami compared to manual segmentation (Fig.  3A), similar to other automatic methods [10]. The reason is unclear. This requires further investigation, because it might limit the detection of mild thalamus atrophy in early disease stages by automatic thalamus volumetry.
Thalamus masks automatically generated by FSL were used as ground truth in the present study. The use of conventional automatic methods to generate ground truth for deep learning has been explored previously [35]. Other freely available software packages also provide segmentation of the thalamus (e.g., FreeSurfer [25], CAT12 [36], volBrain [37]) and, therefore, also could have been used for generation of the ground truth. The rationale for selecting FSL was that it provided the best agreement with manual expert delineation of the thalamus in a study by Sitter and co-workers [34]. Manual thalamus delineation by experts would have been preferred for the network training but was not feasible given the large size of the training set (n = 1975).
Considering the architecture of the 3D-CNN, a U-net structure and the full 3-dimensional T1w-MRI as input were chosen. U-net-based CNNs are the current state-of-the-art for segmentation tasks [38], also compared to other CNN architectures [39]. FastSurfer [16], a recent development based on QuickNAT [40], deploys a 2D-U-net that outperformed a 3D-U-net tested by the developers of FastSurfer. A possible explanation provided by the authors is that their 3D-U-net used small volume patches due to GPU memory constraints [16]. In the present study, the 3D-CNN used rather large patches covering approximately 2/3 of the brain (Fig. 1). Therefore, sufficient context information was contained in each patch. Furthermore, instance normalization was used in the present study instead of the standard batch normalization to cope with the limited batch size. We hypothesize that the 3D design allowed the 3D-CNN to become particularly robust with respect to camera-specific variability of image characteristics.
No manual correction of the automatic segmentation results was performed in this study. Clinical use of automatic thalamus volumetry should include visual quality control of the segmentation and manual correction if required.
In conclusion, the proposed 3D-CNN provides accurate thalamus segmentation that is particularly robust with respect to MRI scanner and protocol changes. This might improve the sensitivity to detect disease-related thalamus atrophy in multisite/multi-scanner settings without the need for scanner-specific normal databases, despite the fact that the 3D-CNN tended to slightly overestimate the volume of small thalami. Further improvement might be achieved by training with manual thalamus segmentation as ground truth. The 3D-CNN provides segmentation of the left and the right thalamus. Thalamus parcellation into nuclei and subnuclei would require further specific training.
Funding Open Access funding enabled and organized by Projekt DEAL.

Declarations
Guarantor The scientific guarantor of this publication is Roland Opfer.

Conflict of interest
The authors of this manuscript declare relationships with the following companies: Roland Opfer, Julia Krüger, Lothar Spies, and Ann-Christin Ostwaldt are employees of jung diagnostics GmbH, Hamburg, Germany. There is no actual or potential conflict of interest for the other authors.
Statistics and biometry One of the authors (RB) has significant statistical expertise.
Informed consent Written informed consent of the subjects included in the training set and in the normal database was waived by the ethics review board of the general medical council of the state of Hamburg, Germany.
The single subject of the Frequently Traveling Human Phantom dataset gave written informed consent.
All patients in the multiple sclerosis dataset gave written informed consent.
Ethical approval Ethics approval for the retrospective use of the Frequently Travelling Human Phantom dataset was obtained from the ethics review board of the general medical council of the state of Hamburg, Germany. The multiple sclerosis patient dataset comprised data from two prospective clinical studies that had been approved by the local ethics committees.

Methodology
• retrospective • cross-sectional study • performed at one institution Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .