Repeatability and reproducibility of FreeSurfer, FSL-SIENAX and SPM brain volumetric measurements and the effect of lesion filling in multiple sclerosis

Objectives To compare the cross-sectional robustness of commonly used volumetric software and effects of lesion filling in multiple sclerosis (MS). Methods Nine MS patients (six females; age 38±13 years, disease duration 7.3±5.2 years) were scanned twice with repositioning on three MRI scanners (Siemens Aera 1.5T, Avanto 1.5T, Trio 3.0T) the same day. Volumetric T1-weighted images were processed with FreeSurfer, FSL-SIENAX, SPM and SPM-CAT before and after 3D FLAIR lesion filling with LST. The whole-brain, grey matter (GM) and white matter (WM) volumes were calculated with and without normalisation to the intracranial volume or FSL-SIENAX scaling factor. Robustness was assessed using the coefficient of variation (CoV). Results Variability in volumetrics was lower within than between scanners (CoV 0.17–0.96% vs. 0.65–5.0%, p<0.001). All software provided similarly robust segmentations of the brain volume on the same scanner (CoV 0.17–0.28%, p=0.076). Normalisation improved inter-scanner reproducibility in FreeSurfer and SPM-based methods, but the FSL-SIENAX scaling factor did not improve robustness. Generally, SPM-based methods produced the most consistent volumetrics, while FreeSurfer was more robust for WM volumes on different scanners. FreeSurfer had more robust normalised brain and GM volumes on different scanners than FSL-SIENAX (p=0.004). MS lesion filling changed the output of FSL-SIENAX, SPM and SPM-CAT but not FreeSurfer. Conclusions Consistent use of the same scanner is essential and normalisation to the intracranial volume is recommended for multiple scanners. Based on robustness, SPM-based methods are particularly suitable for cross-sectional volumetry. FreeSurfer poses a suitable alternative with WM segmentations less sensitive to MS lesions. Key Points • The same scanner should be used for brain volumetry. If different scanners are used, the intracranial volume normalisation improves the FreeSurfer and SPM robustness (but not the FSL scaling factor). • FreeSurfer, FSL and SPM all provide robust measures of the whole brain volume on the same MRI scanner. SPM-based methods overall provide the most robust segmentations (except white matter segmentations on different scanners where FreeSurfer is more robust). • MS lesion filling with Lesion Segmentation Toolbox changes the output of FSL-SIENAX and SPM. FreeSurfer output is not affected by MS lesion filling since it already takes white matter hypointensities into account and is therefore particularly suitable for MS brain volumetry. Electronic supplementary material The online version of this article (10.1007/s00330-018-5710-x) contains supplementary material, which is available to authorized users.


Introduction
Multiple sclerosis (MS) is a common chronic neuroinflammatory and neurodegenerative disease [1]. Demyelinating lesions in the brain and spinal cord are the pathological hallmarks of MS, which are detectable in vivo with magnetic resonance imaging (MRI). MRI has therefore become an essential tool for the diagnosis and monitoring of disease activity in MS [1,2]. In MS, the lesion volume reflects the inflammatory burden while atrophy measures quantify neurodegenerative aspects of the disease, which play an important role in all disease stages [3]. Volumetry is therefore commonly used as a secondary endpoint in clinical trials [4]. Furthermore, volumetry can be helpful in improving our understanding of the disease since atrophy patterns have been shown to be different in MS compared to other demyelinating disorders [5].
Obtaining robust imaging biomarkers in MS for assessment of the inflammatory and neurodegenerative burden of disease is, however, challenging [3]. Brain volumetry is influenced by several subject-related factors such as hydration status, inflammation and clinical therapy [6]. MS lesions can specifically affect tissue segmentations since white matter (WM) lesions can be misclassified as grey matter (GM) or cerebrospinal fluid (CSF) [7,8]. Brain volumetry is also impacted by technical factors such as MRI field strength and scanner model, as well as post-processing related issues [8][9][10]. Understanding the effect and magnitude of technical factors is important when planning MRI studies [8].
There are several freely available tools for automated brain volumetry that are commonly applied in MS. Popular choices include FreeSurfer [11], Structural Image Evaluation with Normalisation of Atrophy Cross-sectional (SIENAX) [12] and Statistical Parametric Mapping (SPM) [13]. These software can automatically pre-process and segment T 1 -weighted images of the brain. FreeSurfer is computationally demanding and is based on a combined volumetric-and surface-based segmentation aimed to reduce partial volume effects from the convoluted shape of the cortical ribbon [11]. FreeSurfer uses a template-driven approach to provide a detailed parcellation and segmentation of the cortex and subcortical structures. SIENAX, part of the FMRIB Software Library (FSL), is computationally less demanding but only provides measurements of the gross tissue volumes (WM, GM and CSF) [12]. FSL-SIENAX relies on registration to the Montreal Neurological Institute 152 template for skull stripping and then performs intensity-based segmentation; the template registration step provides a scaling factor that can be used for normalisation. SPM is based on non-linear registration of the brain to a template and segments brain tissues by assigning tissue probabilities per voxel [13]. Computational Anatomy Toolbox (SPM-CAT) is an extension for SPM that provides segmentations with a different segmentation approach based on spatial interpolation, denoising, additional affine registration steps, local intensity correction, adaptive segmentation and partial volume segmentation [14]. Like FSL-SIENAX, the SPM-based methods are less computationally demanding, relative to FreeSurfer, and only provide gross brain tissue volumes.
The primary purpose of this study was to compare the repeatability on the same scanner and the reproducibility on different scanners for brain tissue segmentations in FreeSurfer, FSL-SIENAX, SPM and SPM-CAT. A secondary aim was to study the effect of automated lesion filling to reduce MS lesion-related brain tissue segmentation bias.

Participants
Nine MS patients (six females, three males; mean age 38±13 years; mean disease duration 7.3±5.2 years) diagnosed according to the McDonald 2010 diagnostic criteria [15], were prospectively recruited from the outpatient clinic at the Department of Neurology, Karolinska University Hospital in Huddinge, Stockholm, Sweden, among consecutive patients referred for a clinical MRI. The participants were representative of the MS population in Sweden, with all subtypes represented in proportion to their frequency in clinical practice: six relapsing-remitting (RR), two secondary progressive, one primary progressive [16]. Exclusion criteria were contraindications to MRI, neurological co-morbidities or a history of head trauma (none were excluded). The physical disability of the patients was assessed according to the Expanded Disability Status Scale [17] by an MS-experienced neurologist (K.F.).
The median physical disability score was 2.0 (range 1.0-5.5). The study was approved by the local ethics committee and written informed consent was obtained from all participants.

MRI protocol
All participants were scanned twice on the same day on all three clinical MRI systems used in the study: Siemens Aera (1.5 T), Avanto (1.5 T) and Trio (3.0 T) (Siemens Healthcare, Erlangen, Germany). A 3D T 1 -weighted magnetisationprepared rapid gradient-echo (MPRAGE) sequence was acquired twice with repositioning in between, resulting in a total of six T 1 -weighted volumes per participant. A representative example of the MPRAGE acquisitions is illustrated in Fig. 1. One 3D T 2 -weighted Fluid-Attenuated Inversion Recovery (FLAIR) was additionally acquired on each scanner for lesion segmentation. The MRI acquisition parameters are detailed in Table 1.

Image analysis
Each of the six 3D T 1 -weighted volumes from each participant was analysed cross-sectionally and processed in FreeSurfer, FSL-SIENAX, SPM and SPM-CAT. No additional preprocessing or manual intervention was performed to avoid introducing biases in the tissue segmentations. All input and output underwent visual quality assurance by an experienced rater (T.G.) and were found to be of satisfactory quality. Examples of the volumetric output are presented in Fig. 2.
FreeSurfer FreeSurfer 6.0.0 (http://surfer.nmr.mgh.harvard. edu, Harvard University, Boston, MA, USA) was used to perform automatic processing as previously described [11,18]. FreeSurfer was run with the options '-mprage' and for the 3.0 T data also '-3T', as recommended by its developers. The variable 'Brain Segmentation Volume Without Ventricles from Surf' was used as the FreeSurfer estimation of the brain volume, which excludes the brainstem. The variable 'Total grey matter volume' was used as the estimation of the GM volume. The WM volume was assessed by summing the 'cerebral WM', 'cerebellar WM', 'brainstem' and 'corpus callosum' FreeSurfer variables. It is notable that FreeSurfer specifically segments white matter hypointensities. For normalisation purposes, the brain volume, GM volume and WM volume were divided by the 'Estimated Total Intracranial Volume'.

FSL-SIENAX
The SIENAX method implemented in FSL 5.0 (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/SIENA, Oxford University, Oxford, UK) was used to obtain an automated quantification of the brain volume, GM volume and WM volume with automatic normalisation for head size with a subject-specific scaling factor, as previously described [19]. For this study, FSL-SIENAX was run with the optimised brain  Table 1) with relapsing-remitting multiple sclerosis and an Expanded Disability Status Scale score of 2.0. For each of the three scanners (Siemens Aera 1.5 T, Siemens Avanto 1.5 T and Siemens Trio 3.0 T) two acquisitions were made with repositioning in between extraction parameters '-B -f 0.1', in accordance with previous recommendations for MS studies [20].
SPM Statistical Parametric Mapping, SPM12, (http://www.fil. ion.ucl.ac.uk/spm, University College London, London, UK) was used to automatically obtain the GM volume, WM volume and total intracranial CSF volume according to an adapted workflow as previously described [21]. The segment tool was run using the default settings. The brain volume in SPM was defined as the sum of the GM and WM volumes. For normalisation, the intracranial volume was used, which was calculated by summing the GM, WM and CSF volumes.
SPM-CAT The Computational Anatomy Toolbox (CAT) 12 is an extension to SPM12 (http://www.neuro.uni-jena.de/cat/ index.html, Jena University Hospital, Jena, Germany) [14]. The cross-sectional data segmentation tool was run using the default settings. The brain volume in SPM-CAT was defined as the sum of the GM and WM volumes and the total intracranial volume was used for normalisation.
Lesion filling Lesion filling was performed on all 3D FLAIR volumes in SPM12 using the lesion probability algorithm in Lesion Segmentation Toolbox 2.0.10 (LST, http://www. applied-statistics.de/lst.html,Technische Universität München, Munich, Germany) [22]. LST provides an automated probabilistic lesion segmentation, specifically developed for MS. It also provides automatic lesion filling without the need for parameter optimisation or binary thresholding of the lesion masks. The FLAIR lesion probability maps were used to perform lesion filling on the corresponding T 1 -weighted volumes from the same scanner [22]. Figure 3 illustrates the input and output of the lesion filling procedure.

Statistical analysis
SPSS Statistics 24.0 was used for the statistical analysis (IBM Corporation, Armonk, NY, USA). Due to the limited sample size, the data were treated as non-parametric. The robustness of repeated measures was assessed using the within-subject coefficient of variation (CoV). For intra-scanner repeatability, the measurements from the first and the second scan from the same scanner were used: CoV Intra-scanner = SD/mean of Scan 1 and Scan 2. For the inter-scanner reproducibility, the first scans from each of the three scanners were used: CoV Interscanner = SD/mean of Scan 1 Aera , Scan 1 Avanto , Scan 1 Trio . Paired comparisons were tested using the Wilcoxon signed ranks test with two-tailed exact significance. Group comparisons between the four software were tested using the Friedman test and in case of significant differences among the software, post hoc paired analyses were performed with the Wilcoxon signed ranks test. Correction for multiple comparisons was performed using the Benjamini-Hochberg procedure separately for the intra-scanner CoVs, inter-scanner CoVs and for each Friedman test post hoc analysis [23]. A corrected p<0.05 was considered statistically significant. All reported p-values were significant after correction for multiple comparisons, unless otherwise specified.

Results
Comparability of the brain volumetry from different software There were notable differences in the numeric brain tissue segmentation output from FreeSurfer, SIENAX, SPM and SPM-CAT, as detailed in Table 2. A full report of the volumetric output can be found in Online Supplementary Table 1.

Repeatability and reproducibility of non-normalised brain volumetry
Repeated measurements on the same scanner generally resulted in lower variability than measurements on the different scanners (median CoV 0.17-0.96% vs. 0.65-5.0%, p<0.001 by Wilcoxon signed ranks test), as further detailed in Table 3. Overall, the brain volume was the most robust tissue segmentation within scanners, with the lowest variability (median CoV 0.17-0.28%), and a comparable performance of all

Effects of normalisation on brain volumetry
Normalising the brain tissue volumes did not have a statistically significant positive effect on the intra-scanner repeatability, as further detailed in Table 3. On the contrary, for the FSL-SIENAX normalised brain volume there was a worsening of the intra-scanner repeatability after normalisation with the scaling factor. Normalisation to the FSL-SIENAX scaling factor did not significantly improve the inter-scanner reproducibility either. In contrast, normalisation to the intracranial volume often improved the reproducibility between scanners for FreeSurfer and the SPM methods. Specifically, significant improvements in the reproducibility were seen for the FreeSurfer normalised brain volume and normalised grey matter volume as well as for the normalised brain volume and white matter volume for both SPM-based methods. When normalising the tissues, FreeSurfer became more robust than FSL-SIENAX across scanners for both the normalised brain volume and normalised GM volume.

Effects of MS lesion filling
The median WM lesion volume was 1.8 ml (range 0.33-24 ml). There was no statistically significant effect of lesion filling on the FreeSurfer volumes, as detailed in Table 2. However, lesion filling caused changes in volumetrics from The exemplified segmentations were based on the first scan on the Aera scanner for this participant, which was the scan with the lowest lesion volume (0.33 ml) in the study. Please note that FreeSurfer specifically segments white matter hypointensities (yellow), highlighted with orange arrows, and includes these in the brain volume, but not in the white matter volume. Meanwhile, FSL-SIENAX, SPM and SPM-CAT classify the white matter hypointensities as grey matter and/or cerebrospinal fluid (orange arrows). CAT Computational Anatomy Toolbox, FSL-SIENAX FMRIB Software Library Structural Image Evaluation with Normalisation of Atrophy Cross-sectional, SPM Statistical Parametric Mapping, T1WI T 1 -weighted imaging FSL-SIENAX, SPM and SPM-CAT. Most notably, highly significant changes were seen for all tissue compartments in SPM-CAT with increases in the estimations of the brain and WM volumes and decreases in the GM estimations, both for the non-normalised and normalised data. Lesion filling did not significantly affect the inter-scanner CoV for any of the software (data not shown).

Discussion
We present a prospective head-to-head comparison of the robustness of four of the most popular freely available brain segmentation tools in a representative real-life MS cohort scanned twice on three different scanners on the same day. New versions of the tested software have recently been released. An important contribution of the current study is therefore that we provide an up-to-date evaluation of the intra-and inter-scanner variability of brain tissue measurements in MS, facilitating an appropriate choice of software for volumetric studies. We found that the volumetric output differed between the software, which is expected since they have large technical differences [11][12][13]. Previous studies of earlier versions of the software have indeed also found significant differences in the output, both numerically and topographically [24][25][26]. While most previous studies have focused on differences and similarities in the segmentation results [24][25][26], the current study mainly focused on the robustness of the segmentation tools. Overall, we report that the variability in volumetrics was lower on the same scanner than between scanners, supporting recommendations to follow individuals on the same scanner [27,28]. Although brain atrophy rates can be double that of normal aging in untreated MS patients [29], treated MS patients have atrophy rates around 0.5%/year [30]. To accurately capture atrophy rates, it is therefore important to have a variability lower than that. Our reported CoVs for intra-scanner (0.17-0.92%) and inter-scanner (0.65-5.0%) variability suggest that measurements are feasible within 1-2 years for the Fig. 3 Illustration of the lesion segmentation and filling procedure in a 34-year-old male (referred to as MS5 in Online Supplementary Table 1) with relapsing-remitting multiple sclerosis and an Expanded Disability Status Scale score of 1.5. This representative scan from the Siemens Trio 3.0 T scanner provided the median lesion volume of the cohort (1.8 ml). The 3D T 2weighted FLAIR image (a) was used for lesion segmentation in Lesion Segmentation Toolbox, resulting in a probabilistic lesion mask (b, displayed as a heat map overlaid on a). The lesion mask was used to fill in lesions on the 3D T 1 -weighted image (c), providing the lesion-filled 3D T 1weighted image (d) All metrics given as median±interquartile range. Non-normalised (upper three rows) and FSL-SIENAX measurement are given in millilitres. Normalised measurements of FreeSurfer and SPM are given as unit-less tissue fractions in %. P-values represent the comparison of original and lesioned-filled volumes by Wilcoxon signed ranks test (exact significance, two-tailed) CAT Computational Anatomy Toolbox, FSL-SIENAX FMRIB Software Library Structural Image Evaluation with Normalisation of Atrophy Cross-sectional, GM Grey matter, SPM Statistical Parametric Mapping, WM White matter * Not statistically significant after correction for multiple comparisons Table 3 Repeatability and reproducibility of the brain tissue volumes  most robust methods on the same scanner. In contrast, several years need to pass to be able to capture atrophy on different scanners, even with normalisation. SPM-based methods overall had the best repeatability and reproducibility of the four software (except WM segmentations where FreeSurfer was more robust) and are therefore particularly suitable for cross-sectional MS studies. This is in line with a previous international study of two MS patients scanned at multiple sites and a segmentation challenge in persons with diabetes mellitus and cardiovascular risk factors [31,32]. We also found that the whole-brain volume was the most robust volumetric, consistent with previous results [31,33]. This could be explained by lower variability with a large volume of interest and a larger contrast difference of CSF versus brain parenchyma compared to GM/WM segmentations. In studies with differences in the MRI protocols, it can therefore be recommended to primarily focus on the brain volume. Interestingly, there was no significant difference in the intrascanner robustness of the software for the brain volume, meaning that all studied software can be favoured for crosssectional MS studies of the brain volume.
The current study focuses on some of the most commonly used freely available automated segmentation tools for brain volumetrics in MS, but there are several other segmentation tools available, such as AFNI and BrainSuite. While we provide information on the robustness of the studied software, the choice of software must also take other factors into account, such as which types of images are available, user skills and technical requirements [8]. In this study, we only provided the T 1 -weighted images for segmentation, which is the only image contrast that FSL-SIENAX and SPM-CAT are optimised for [12,14]. Previous results with segmentation based on multiple contrasts or multi-parametric maps have shown especially good robustness [32][33][34]. Evaluating such approaches is therefore an interesting avenue for future studies. From a technical standpoint, full functionality of SPM requires a MATLAB license [13], but a standalone version of SPM or FreeSurfer could be suitable alternatives since FreeSurfer was found to provide more robust normalised measurements between scanners than FSL-SIENAX, consistent with previous results [35]. While FreeSurfer is computationally more intense than the other software, it also provides more detailed regional morphometry.
Normalisation of the brain volumetrics to the intracranial volume generally improved the comparability of results between scanners, in line with previous recommendations [8]. This is likely due to a reduction of scaling effects between scanners [8]. However, using the scaling factor in FSL-SIENAX did not improve the robustness, suggesting that such normalisation may not be sufficient. Overall, there was also a lack of improvement in the repeatability within scanners for all three software with the normalisation. This finding likely reflects that normalisation procedures are less critical if measurements are produced on the same scanner. In clinical practice and longitudinal studies it is, however, important to consider that the variability in measurements are likely to be higher than that presented in this study, where all measurements were performed on the same day [31].
In terms of the effect of MS lesion filling, we found that lesion filling affected the volumetric results mainly for SPM and SPM-CAT, but also for FSL-SIENAX. These results are consistent with a previous MS study showing increased accuracy of SPM8 segmentations after lesion filling [36]. Of note, no effect was seen on the FreeSurfer volumes with lesion filling, likely due to the fact that FreeSurfer specifically segments WM T 1 -hypointensities and thus take these into account during the WM segmentations [11].
This study has some limitations. First, the sample size is small, but in total there were 54 measurements since each patient was scanned twice on three scanners and the study showed statistically significant differences in robustness of the software. Second, the MRI scanners were all from the same manufacturer, while higher inter-scanner variability would be expected with multiple vendors [31]. Third, although the results of the study could change by adjusting acquisition or processing parameters, these results reflect the standard procedures for MRI in MS at Karolinska University Hospital and we used recommended post-processing options [11,13,20]. There was a difference in the resolution between the FLAIR volumes, which could affect the lesion filling but this difference was consistent for the input of all software. Lastly, the current study focused solely on crosssectional segmentation methods while the robustness of segmentations can be improved by including a priori knowledge of several time-points [19,35,37]. We therefore recommend future studies to also focus on comparing the robustness of longitudinal segmentation methods.
In conclusion, the results highlight the importance of consistently using the same scanner and normalising to the intracranial volume when multiple scanners are used. The output from FreeSurfer, FSL-SIENAX and SPM differ but all three software provide cross-sectional brain volume segmentations with similar intra-scanner robustness. SPM-based methods overall produced the most consistent results, while FreeSurfer had less variability in WM volume segmentations across scanners and was less affected by WM lesions.