Introduction

There is growing interest in MRI-based volumetry of the thalamus, for example, in the management of patients with multiple sclerosis (MS). The thalamus is among the brain structures with the earliest signs of atrophy detectable in MRI in the course of MS [1], and MRI-based thalamus volume is a promising marker to predict the transition from clinically isolated syndrome to clinically definite MS [2]. Furthermore, thalamus atrophy is among the strongest predictors of cognitive impairment in MS [3] and therefore can serve as a surrogate outcome for cognition in MS therapy trials [4, 5].

Manual delineation by an expert is the ground truth for MRI-based volumetry of the thalamus. Methods for automatic thalamus segmentation have been developed to facilitate thalamus volumetry [6,7,8,9], as manual segmentation by an expert is not compatible with the busy clinical routine at most sites. However, automatic brain volumetry methods are sensitive to the MRI scanner platform and details of the acquisition protocol [10,11,12,13,14]. This limits the sharing of normal databases and/or cutoffs between sites and/or scanners, which in turn detracts from the utility of automatic brain volumetry for widespread clinical use. Thus, there is a great need for automatic brain volumetry methods that are more robust with respect to scanner-related variability of image characteristics in MRI (e.g., gray-to-white matter contrast, signal-to-noise).

Convolutional neural networks (CNNs) outperform conventional machine learning approaches in many medical imaging tasks including MRI-based brain volumetry [15, 16]. CNNs are also particularly promising in non-harmonized multi-site settings, since they can be made robust with respect to scanner-related variability by training with a heterogeneous dataset covering the whole range of image characteristics encountered in the considered multi-site setting [17].

Against this background, the present study trained a 3D-CNN for automatic thalamus segmentation in T1w-MRI on a large heterogeneous dataset. The 3D-CNN was compared with a publicly available 2D-CNN (FastSurfer [16]) and FSL [18] with respect to accuracy, intra- and inter-scanner test-retest stability, and sensitivity to detect thalamus atrophy.

Materials and methods

All datasets used in this study comprise 3D gradient echo T1w volume images of the brain acquired with scanner-specific acquisition protocols recommended by the manufacturer for MRI-based brain volumetry. Slice thickness ranged between 0.9 and 1.3 mm. TE time ranged between 1.7 and 5.1 ms (mean 3.1 ± 0.6 ms). TR time ranged between 4.7 and 25 ms (mean 10.1 ± 6.0 ms) for GE and Philips scanners, and between 4.5 and 3000 ms (mean 1837 ± 459 ms) for Siemens scanners.

A summary of the datasets is provided in Table 1.

Table 1 Overview of the datasets

Dataset for the training of the 3D-CNN

The training dataset comprised 1975 clinical T1w volume images of 1975 different patients from 170 different MRI scanners. The volume images were randomly selected from a larger dataset to achieve a rather uniform distribution of patient age between 20 and 90 years. No additional eligibility criteria were applied to guarantee that the training set covered the whole range of T1w volume images encountered for MRI-based volumetry in clinical routine. In particular, there were no eligibility criteria with respect to the patients’ health status.

IBSR dataset for the assessment of segmentation accuracy

Eighteen T1w volume images together with masks of the subcortical regions including the thalamus manually delineated by experts freely available from the Internet Brain Segmentation Repository (IBSR, https://www.nitrc.org/projects/ibsr/) were used for the assessment of segmentation accuracy.

Frequently traveling human phantom (FTHP) dataset for the assessment of intra- and inter-scanner variability

T1w volume images of a single healthy middle-aged male subject who completed 123 imaging sessions on 123 different MRI scanners were used for the assessment of intra- and inter-scanner variability. Most imaging sessions comprised three to five repeat acquisitions (without repositioning) resulting in a total of 477 volume images of the same healthy subject within a period of about 2.5 years.

The FTHP dataset is freely available for research purposes (https://www.kaggle.com/datasets/ukeppendorf/frequently-traveling-human-phantom-fthp-dataset).

MS patient dataset to test sensitivity of single subject analysis

T1w volume images of 127 MS patients were included retrospectively to assess the sensitivity of the 3D-CNN to detect thalamus atrophy. Thirty-three of them had been enrolled in a clinical study at the Institute of Diagnostic and Interventional Neuroradiology of the University Hospital Carl Gustav Carus, Dresden, Germany (age 42.2 ± 10.1 years, Expanded Disability Status Scale 2.7 ± 1.6, disease duration 5.2 ± 4.8 years). The remaining 94 MS patients had participated in an observational study at the University Hospital of Zurich, Switzerland (age 37.3 ± 8.9 years, Expanded Disability Status Scale 1.3 ± 1.3, disease duration 2.7 ± 4.5 years).

Normal database for single subject analysis

A sample of 4872 T1w-MRI from 4872 different patients acquired on 160 different MRI scanners for unspecific symptoms (headache, dizziness) was used as normal database for single subject analyses. None of the patients had a history of or currently ongoing neurological or psychiatric condition. All volume images were free of abnormalities beyond those expected for the patients’ age based on visual inspection by the local radiologist.

Ethics approval and consent to participate

The MRI data of the training set and of the normal database had been transferred to jung diagnostics GmbH under the terms and conditions of the European General Data Protection Regulation for remote image analysis. Subsequently, the data had been anonymized. The need for written informed consent for the retrospective use of the anonymized data in the present study was waived by the ethics review board of the general medical council of the state of Hamburg, Germany.

Ethics approval for the retrospective use of the FTHP dataset was obtained from the same ethics review board. The single subject had given written informed consent.

The MS patient dataset comprised data from two prospective studies that had been approved by the local ethics committees. All patients had given written informed consent. This included the retrospective use of the data for the present study.

Thalamus segmentation with FSL

The FIRST module of the FSL Software (FSL; version 6.0.2; http://fsl.fmrib.ox.ac.uk/fsl) provides binary masks of deep gray matter structures including the thalamus [18].

3D-CNN for thalamus segmentation

A 3D-CNN-U-net architecture [19, 20] recently introduced for the segmentation of white matter hyperintensities in T2w-MRI [15, 21] was used (Fig. 1). The network was trained for the simultaneous segmentation of the right thalamus, left thalamus, remaining deep gray matter structures, and remaining total intracranial volume (TIV) (Fig. 1). The additional classes (beyond the left and right thalamus) were introduced to provide regional context which has been proven beneficial for other segmentation tasks [22].

Fig. 1
figure 1

The proposed 3D-CNN U-net: A patch-wise approach with a patch size of 160 × 160 × 160 cubic voxels of 1-mm edge length was used. A fully convolutional encoder-decoder architecture with 3D convolutions, residual-block-connections, and four reductions of the feature map size was employed. The network was trained for simultaneous segmentation of the left and right thalamus, remaining deep gray matter structures, and remaining total intracranial volume

The ground truth for the deep gray matter structures was obtained with FSL. The ground truth for the TIV was derived using a validated algorithm [23] based on the Statistical Parametric Mapping framework (version SPM12, https://www.fil.ion.ucl.ac.uk/spm/software/spm12/) [24]. The 3D-CNN was trained using massive data augmentation including simulation of random bias fields and statistical noise in addition to standard augmentation techniques (rotation, flipping). Manual quality control of ground truth segmentation was not performed.

A more detailed description of the 3D-CNN architecture, the training, and the data augmentation is given in the Supplementary Material.

The trained 3D-CNN is available from the authors upon request under a non-disclosure agreement for non-commercial use.

Thalamus segmentation with FastSurfer

The recently introduced FastSurfer pipeline [16] deploys a 2D-U-net-CNN trained on masks automatically generated with FreeSurfer [25]. FastSurfer is freely available (https://github.com/Deep-MI/FastSurfer). FastSurfer was used with default parameter settings and the pre-trained networks.

Segmentation accuracy

The accuracy of bilateral thalamus segmentation by 3D-CNN, FastSurfer, and FSL in the IBSR dataset was characterized by the Dice similarity coefficient and the Hausdorff distance relative to the manual expert delineation. Analysis of variance for repeated measurements was used to compare the Dice coefficient and the Hausdorff distance between 3D-CNN, FastSurfer, and FSL. Bland-Altman plots were used to compare bilateral thalamus volume (THALV) estimates between the automatic methods and the manual ground truth.

Intra- and inter-scanner variability

Intra- and inter-scanner variability of THALV were assessed in the FTHP dataset. To measure intra-scanner variability, the residuals with respect to the mean THALV of all repeat scans with the same scanner (without repositioning) were computed for each THALV estimate. The intra-scanner residuals were pooled into a single vector (of length 477). To measure inter-scanner variability, first the mean values of THALV were computed for each scanner and pooled into a single vector (of length 123). Then, the mean overall mean values were subtracted element-wise from this vector, resulting in a vector of inter-scanner residuals with zero mean. Levene’s test for equality of variances was used for pairwise comparison of the variance of intra- and inter-scanner residuals between 3D-CNN, FastSurfer, and FSL. Bonferroni correction was used to adjust the significance level for the number of pairwise tests; that is, p values smaller than 0.05/3 = 0.017 were considered statistically significant. The 50th and 95th percentiles of the absolute value of intra- and inter-scanner residuals were determined to further characterize intra- and inter-scanner variability.

Single subject analysis

Regional brain volumes including THALV are strongly correlated with head size [26] and subject age [27]. The TIV estimated in T1w-MRI using a 3D-CNN specifically trained for this purpose was used as surrogate of head size. Removing between-subjects variability associated with TIV and age can improve the power of MRI-based brain volumetry to detect disease-related alterations. In the present study, the following procedure was used to account for the impact of TIV and age on THALV. First, the THALV-TIV-age relationship was modeled with THALV = a ∗ TIV2 + b ∗ age2 + c ∗ TIV ∗ age + d ∗ TIV + e ∗ age + f by minimizing the sum of squared differences in the normal database. Then, the residual of THALV with respect to this model, denoted resTHALV, was computed for each subject in the normal database, that is, resTHALV = THALV − (a ∗ TIV2 + b ∗ age2 + c ∗ TIV ∗ age + d ∗ TIV + e ∗ age + f) [28]. Since regression can be affected by outliers, a two-step approach was used. After the first regression, outliers were identified and removed from the second and final regression. Subjects with resTHALV < lower quartile − 1.5*inter-quartile range of resTHALV in the normal database or resTHALV > upper quartile + 1.5*inter-quartile range were considered outliers.

The 95% confidence interval (95% CI) of resTHALV was computed as [−1.96*std, +1.96*std], where std is the standard deviation of resTHALV relative to the final THALV-TIV-age model in the normal database excluding outliers.

Modeling of the THALV-TIV-age relationship in the normal database was performed separately for THALV estimates from 3D-CNN, FastSurfer, and FSL. The same TIV estimate was used for the three thalamus segmentation methods.

For single subject analysis of THALV in MS patients, a two-sample test approach was used, which is more conservative than a one-sample test approach [29]. More precisely, the residual resTHALV of the patient’s THALV was computed with respect to the THALV-TIV-age model (obtained in the normal database) using the age and the individual TIV estimate for the patient. The 95% CI of the patient’s resTHALV was approximated by the 95% CI of the inter-scanner variability in the FTHP. The patient’s THALV was considered reduced if resTHALV including its 95% CI was below the 95% CI of resTHALV in the normal database (Supplementary Fig. 1) [30]. The proportion of MS patients with reduced THALV was compared between 3D-CNN, FastSurfer, and FSL.

Results

Each of the two steps for the 3D-CNN training took approximately 2 days using a 2.4GHz CPU (Intel Xeon Silver 10-Core) with a GPU Quadro P5000 with 16GB memory. Computation time for thalamus segmentation was approximately 1 min with 3D-CNN and FastSurfer (4GB GPU memory needed), and approximately 5 min with FSL.

Visual inspection did not reveal clear failures of thalamus segmentation in any of the test cases (IBSR, FTHP, and MS dataset) with any of the automatic methods.

Segmentation accuracy

The results on segmentation accuracy are summarized in Table 2 and Figs. 2 and 3. The Dice coefficients of the automatically generated thalamus masks relative to the manual expert delineation did not differ between 3D-CNN and FastSurfer; they were slightly higher for FSL (3D-CNN and FastSurfer 0.86 ± 0.02, FSL 0.87 ± 0.02; ANOVA with Greenhouse-Geisser non-sphericity correction p = 0.003; 3D-CNN or FastSurfer versus FSL p = 0.002, Cohen’s d = 0.50; 3D-CNN versus FastSurfer p = 0.274, d = 0.01; Fig. 2). The Hausdorff distance of the automatically generated thalamus masks relative to the manual ground truth masks did not differ between 3D-CNN and FSL; it was slightly higher for FastSurfer (4.51 ± 0.91 mm, 4.55 ± 0.90 mm, and 5.18 ± 1.03 mm; ANOVA with Greenhouse-Geisser non-sphericity correction p = 0.033; 3D-CNN versus FastSurfer p = 0.023, d = 0.68; 3D-CNN versus FSL p = 0.848, d = 0.04; FastSurfer versus FSL p = 0.054, d = 0.65; Fig. 2). All automatic methods overestimated THALV on average compared to the expert delineation (p < 0.0005; Fig. 3). This was mainly driven by overestimation of small THALV for 3D-CNN and FSL, and by overestimation of large THALV for FastSurfer (Fig. 3). Examples are shown in Fig. 4.

Table 2 Accuracy of automatic thalamus segmentation with 3D-CNN, FastSurfer, and FSL compared to the manual expert delineation in the IBSR dataset
Fig. 2
figure 2

Dice similarity coefficient (left) and Hausdorff distance (right) of thalamus segmentation with 3D-CNN, FastSurfer, and FSL relative to the manually derived expert masks in the IBSR dataset

Fig. 3
figure 3

Bland-Altman plots of automatic thalamus volume estimates from 3D-CNN (a), FastSurfer (b), and FSL (c) in comparison with the manually derived expert masks in the IBSR dataset

Fig. 4
figure 4

Manually delineated ground truth thalamus mask (left column) and the masks automatically generated with each of the three software tools for two cases from the ISBR dataset (first row: case 6, second row: case 18). Case 6 had the highest Hausdorff distance between the ground truth and the 3D-CNN (7.28mm), and case 18 the lowest (3.46 mm). True positive voxels of the automatic masks are shown in blue, false positive voxels in green, and false negative voxels in red. The examples illustrate the observed overestimation of the thalamus volume relative to the manual ground truth by all automatic methods. The examples also indicate that the automatic methods tend to generate smoother segmentation surfaces than manual segmentation

Intra- and inter-scanner variability

Results on inter- and intra-scanner variability are summarized in Table 3 and Fig. 5. The variance of the intra-scanner residuals of THALV in the FTHP dataset was significantly smaller for 3D-CNN (std = 0.08 mL, p < 0.0005) and FastSurfer (std = 0.09 mL, p = 0.012) compared to that in FSL (std = 0.15 mL). The reduction of the intra-scanner residuals with 3D-CNN compared to FastSurfer missed the Bonferroni-corrected significance threshold (p = 0.023 > 0.017). The variance of the inter-scanner residuals of THALV was significantly smaller for 3D-CNN (std = 0.28 mL) compared to that for FastSurfer (std = 0.62 mL, p < 0.0005) and FSL (std = 0.63 mL, p < 0.0005). The difference of the variance of the inter-scanner residuals between FastSurfer and FSL was not significant (p = 0.816).

Table 3 Intra- and inter-scanner variability of thalamus volume estimates in the FTHP dataset
Fig. 5
figure 5

Intra- (a) and inter-scanner (b) variability of THALV estimated with 3D-CNN, FastSurfer, and FSL in the FTHP dataset

Single subject analysis

Scatter plots of TIV- and age-corrected resTHALV versus age in the normal database are shown in Fig. 6. The standard deviation of resTHALV in the normal database was 0.88 mL, 1.09 mL, and 1.15 mL for 3D-CNN, FastSurfer, and FSL (Table 4).

Fig. 6
figure 6

The three plots show the resTHALV of the 4872 subjects in the normal database (blue dots) for 3D-CNN (a), FastSurfer (b), and FSL (c). The red dots indicate the MS patients. The black dots represent resTHALV for the frequently traveling human phantom (FTHP) dataset

Table 4 Between-subjects variability of thalamus volume estimates in the normal database and proportion of MS patients with significantly reduced thalamus volume

The proportion of MS patients with reduced THALV was 24%, 16%, and 11% for 3D-CNN, FastSurfer, and FSL (Table 4; Fig. 6).

Discussion

Thalamus segmentation with the 3D-CNN was similarly accurate as with FastSurfer and FSL, but it was considerably more robust against repeat scanning, particularly against repeat scanning on different MRI scanners: inter-scanner variability of the estimated thalamus volume was lower by more than a factor two with the 3D-CNN (0.28 mL; Table 3) than with FastSurfer (0.62 mL) and FSL (0.63 mL). This might be explained by the large size and the heterogeneity of the training dataset comprising 1975 T1w-MRI of 1975 different patients from 170 different MRI scanners. For comparison, FastSurfer was trained on 160 T1w-MRI [16]. Furthermore, massive data augmentation was used for 3D-CNN training including adding random bias fields and noise in addition to standard augmentation. This might have forced the 3D-CNN to focus on the relevant features rather than tuning results too strongly to specific image characteristics of a given scanner or acquisition sequence.

Reduction of intra- and inter-scanner variability might be achieved not only by reduction of variance of no interest (associated with scanner-related variability of image characteristics) but also by reduction of variance of interest associated with actual between-subjects variability (a method that simply returns a thalamus volume of 15 mL for all subjects performs perfectly well with respect to intra- and inter-scanner variability, but clearly is useless). The higher proportion of patients with significantly reduced thalamus volume in the MS dataset when using the 3D-CNN (24%) compared to FastSurfer (16%) and FSL (11%) suggests that this is not the case with the novel 3D-CNN. It rather suggests that the improved intra- and inter-scanner stability of the 3D-CNN resulted in increased sensitivity for the detection of thalamus atrophy in MS [31, 32]. This is further supported by the fact that the disease severity in the MS patients was significantly correlated with the THALV estimates from the 3D-CNN and from FastSurfer but not with those from FSL (section “Impact of the segmentation method on the correlation between the thalamus volume and disease severity in MS” in the supplementary material). However, the correlations were weak (Pearson correlation coefficient ≤ 0.29) and therefore should be interpreted cautiously. The low strength of the observed correlations is in line with previous studies that reported either a significant weak correlation between EDSS and thalamus volume (e.g., Datta et al: r = −0.133, p < 0.001, 924 MS patients [31]) or lack of significant correlation (e.g., Tommasin et al: Spearman correlation coefficient = 0.07, p ≥ 0.01, 163 MS patients [33]).

Intra- and inter-scanner variability of MRI-based regional brain volume estimates are often reported as absolute (in mL) or relative differences (in %) [12, 13]. In the context of clinical applications of MRI-based brain volumetry, in which individual volume estimates are compared to a normal database, it might be more appropriate to specify intra- and inter-scanner variability in relation to the between-subjects variability in the normal database. Figure 6 demonstrates that the inter-scanner variability of the thalamus volume estimates in the FTHP dataset covers a significant fraction of the 95% CI of between-subjects variability in the normal database. This suggests that inter-scanner variability of thalamus volume estimates significantly contributes to the between-subjects variability in a normal database comprising scans from different scanners. Thus, reduction of inter-scanner variability results in relevant reduction of normal between-subjects variability. This is highly relevant, because it is expected to result in greater power for the detection of disease-related alterations. The ratio of inter-scanner variability (in the FTHP dataset) to between-subjects variability in the normal database was 0.28/0.88 = 0.32 for 3D-CNN, 0.62/1.09 = 0.57 for FastSurfer, and 0.63/1.15 = 0.55 for FSL (Tables 3 and 4). The larger fraction of MS patients with reduced thalamus volume according to the 3D-CNN estimates is in line with the lower inter-scanner-to-between-subjects variability ratio for the 3D-CNN.

In clinical applications of MRI-based brain volumetry, reliable detection of disease-related alterations is more important than estimation of the actual volume of the brain region of interest with highest possible accuracy (but lower precision). However, the 3D-CNN provides also high accuracy for thalamus segmentation as indicated by the mean Dice coefficient of 0.86 relative to manual expert delineation. Sitter and co-workers reported Dice coefficients between repeat manual thalamus segmentation by the same expert ranging from 0.87 to 0.91 [34].

The 3D-CNN resulted in some overestimation of the volume of small thalami compared to manual segmentation (Fig. 3A), similar to other automatic methods [10]. The reason is unclear. This requires further investigation, because it might limit the detection of mild thalamus atrophy in early disease stages by automatic thalamus volumetry.

Thalamus masks automatically generated by FSL were used as ground truth in the present study. The use of conventional automatic methods to generate ground truth for deep learning has been explored previously [35]. Other freely available software packages also provide segmentation of the thalamus (e.g., FreeSurfer [25], CAT12 [36], volBrain [37]) and, therefore, also could have been used for generation of the ground truth. The rationale for selecting FSL was that it provided the best agreement with manual expert delineation of the thalamus in a study by Sitter and co-workers [34]. Manual thalamus delineation by experts would have been preferred for the network training but was not feasible given the large size of the training set (n = 1975).

Considering the architecture of the 3D-CNN, a U-net structure and the full 3-dimensional T1w-MRI as input were chosen. U-net-based CNNs are the current state-of-the-art for segmentation tasks [38], also compared to other CNN architectures [39]. FastSurfer [16], a recent development based on QuickNAT [40], deploys a 2D-U-net that outperformed a 3D-U-net tested by the developers of FastSurfer. A possible explanation provided by the authors is that their 3D-U-net used small volume patches due to GPU memory constraints [16]. In the present study, the 3D-CNN used rather large patches covering approximately 2/3 of the brain (Fig. 1). Therefore, sufficient context information was contained in each patch. Furthermore, instance normalization was used in the present study instead of the standard batch normalization to cope with the limited batch size. We hypothesize that the 3D design allowed the 3D-CNN to become particularly robust with respect to camera-specific variability of image characteristics.

No manual correction of the automatic segmentation results was performed in this study. Clinical use of automatic thalamus volumetry should include visual quality control of the segmentation and manual correction if required.

In conclusion, the proposed 3D-CNN provides accurate thalamus segmentation that is particularly robust with respect to MRI scanner and protocol changes. This might improve the sensitivity to detect disease-related thalamus atrophy in multi-site/multi-scanner settings without the need for scanner-specific normal databases, despite the fact that the 3D-CNN tended to slightly overestimate the volume of small thalami. Further improvement might be achieved by training with manual thalamus segmentation as ground truth. The 3D-CNN provides segmentation of the left and the right thalamus. Thalamus parcellation into nuclei and subnuclei would require further specific training.