Technical challenges of quantitative chest MRI data analysis in a large cohort pediatric study

Objectives This study was conducted in order to evaluate the effect of geometric distortion (GD) on MRI lung volume quantification and evaluate available manual, semi-automated, and fully automated methods for lung segmentation. Methods A phantom was scanned with MRI and CT. GD was quantified as the difference in phantom’s volume between MRI and CT, with CT as gold standard. Dice scores were used to measure overlap in shapes. Furthermore, 11 subjects from a prospective population-based cohort study each underwent four chest MRI acquisitions. The resulting 44 MRI scans with 2D and 3D Gradwarp were used to test five segmentation methods. Intraclass correlation coefficient, Bland–Altman plots, Wilcoxon, Mann–Whitney U, and paired t tests were used for statistics. Results Using phantoms, volume differences between CT and MRI varied according to MRI positions and 2D and 3D Gradwarp correction. With the phantom located at the isocenter, MRI overestimated the volume relative to CT by 5.56 ± 1.16 to 6.99 ± 0.22% with body and torso coils, respectively. Higher Dice scores and smaller intraobject differences were found for 3D Gradwarp MR images. In subjects, semi-automated and fully automated segmentation tools showed high agreement with manual segmentations (ICC = 0.971–0.993 for end-inspiratory scans; ICC = 0.992–0.995 for end-expiratory scans). Manual segmentation time per scan was approximately 3–4 h and 2–3 min for fully automated methods. Conclusions Volume overestimation of MRI due to GD can be quantified. Semi-automated and fully automated segmentation methods allow accurate, reproducible, and fast lung volume quantification. Chest MRI can be a valid radiation-free imaging modality for lung segmentation and volume quantification in large cohort studies. Key Points • Geometric distortion varies according to MRI setting and patient positioning. • Automated segmentation methods allow fast and accurate lung volume quantification. • MRI is a valid radiation-free alternative to CT for quantitative data analysis. Electronic supplementary material The online version of this article (10.1007/s00330-018-5863-7) contains supplementary material, which is available to authorized users.


Introduction
Computed tomography (CT) is the most used technique for quantitative lung imaging because of its high spatial resolution and signal-to-noise ratio [1,2]. Consequently, quantitative imaging analysis using chest CT is better developed and validated compared to chest magnetic resonance imaging (MRI) [3][4][5][6]. Nevertheless, MRI is being developed as a feasible radiation-free alternative imaging modality [1,7]. However, several technical challenges hamper quantitative analysis with MRI, namely protocol standardization, low signal-to-noise ratio, and low spatial resolution. Volume computation in MR is also limited due to geometric distortion (GD) [1,3], which is mainly caused by magnetic field inhomogeneity and nonlinearity of gradient coils within the scanner [7,8]. Image processing techniques are available for GD correction and are commonly employed by manufacturers (i.e., BGradwarp,^General Electric Healthcare). These techniques are particularly important for the delineation of target volumes for radiotherapy of cancer, where several models to correct GD have been analyzed [8][9][10][11][12][13].
Existing literature has focused on data with relatively small field-of-view (FOV) or on anatomical locations close to the isocenter where GD is minimal, such as in MRI protocols for prostate, brain, and neck tumor size quantification [7,14,15]. Conversely, lung imaging requires larger FOV and is therefore more influenced by GD, as magnetic field inhomogeneities make GD more pronounced the farther the object scanned is from the isocenter [7,10,16]. Consequently, peripheral lung portions (i.e., costophrenic angles) are most affected by GD. In addition, different MRI settings and patient's positioning can influence the magnitude of GD [7,17].
To the best of our knowledge, no previous publications have assessed the effect of GD on lung volume quantification in chest MRI and specifically the validation of lung volume quantification in MRI against CT. This study addresses the problem of correcting for magnetic field inhomogeneity.
In addition, this study evaluates manual, semi-automated, and fully automated methods for lung segmentation and volume quantification with MRI. Lung segmentation is a fundamental step for image analysis and is aimed to extract quantitative information. Although manual segmentation with delineation of lung boundaries on each image can give accurate results, it is laborious. Therefore, various segmentation methods have been developed for CT images. Few studies have been conducted on segmentation methods for MR images [18,19], because it is believed to be more difficult and to have more variations than CT volumetry. In this study, we assessed segmentation methods in accuracy, reproducibility, and time efficiency to determine the best segmentation strategy for lung volume quantification using MRI in the Generation R Study, a large prospective population-based cohort study, described in detail in the Supplementary material.
In summary, we aimed (1) to quantify GD for different MRI scan settings on volume measurements compared to CT using phantoms and (2) to assess the accuracy, reproducibility, and time efficiency of semi-automated and fully automated lung volume segmentation tools compared to manual segmentations of MRI measurements obtained in children.

Phantom data
A MRI body phantom (Fig. E1), with four bottles filled with potassium sorbate (General Electric Healthcare), was used to assess the effect of GD on volume quantification depending on six different scan settings: (1) reference position with phantom centered in the scanner isocenter, (2) electronic displacement of FOV to simulate incorrect FOV positioning by a MRI technician, (3) manual displacement of phantom to simulate possible patient's movements in the scanner, (4) table repositioning to simulate whole-body MRI protocol, (5) parallel imaging with different acceleration factors for faster image acquisition, and (6) use of torso coil to replicate the lung MRI protocol of the prospective population-based Generation R cohort study. For settings (2) and (3), eight different phantom positions distanced 5 cm from the isocenter were tested: left (L), right (R), inferior (I), superior (S), left inferior (LI), right inferior (RI), left superior (LS), and right superior (RS). For setting (5), four different acceleration factors were tested (1, 2.25, 4, 5). For setting (6), three positions were tested: torso coil centered on the subject and torso coil distanced 5 cm left or right from the center. All images, except setting (6), were acquired with body coil to obtain the most homogenous signal from the phantom and thus facilitating volume segmentation. Images were collected with in-plane bidimensional (2D Gradwarp) and full three-dimensional (3D Gradwarp) GD correction. These correction techniques correct spatial distortion artifacts and blurring at the extreme margins of MR images determined by only nongradient field nonlinearity [20].

Subjects' data
To test lung volume segmentation methods, lung MRI data of a subset of 11 anonymized children were randomly selected from the Generation R Study [21,22]. After written informed consent (METC-2012-165), children underwent whole-body MRI, including brain, heart, hips, and lung MRI acquisitions. The MRI scans were carried out in a specially designed childfriendly MRI research facility. From November 2014 to January 2016, 5000 MRI scans were acquired in the Generation R Study. Each subject underwent two endinspiratory and two end-expiratory spirometer-guided MRI acquisitions. Data were acquired with 2D and 3D Gradwarp. In particular, 3D Gradwarp of the scanner was applied to one end-inspiratory and one end-expiratory scan (Fig. 1).
More information about the Generation R Study and parameters for MRI and CT acquisitions are presented in the Supplementary material.

Phantom segmentation
Phantom volume measurements were manually obtained with MRI and CT through signal intensity thresholding segmentation Fig. 1 Flowchart of acquisition scheme per subject. Each subject (n = 11) underwent two end-inspiratory and two end-expiratory acquisitions. 2D and 3D Gradwarp correction was applied to one end-inspiratory and one end-expiratory scan. In total, 11 subjects underwent four acquisitions, resulting in 44 scans using AW Server 2 platform (AWS) by GEHC and 3D Slicer software (http://www.slicer.org) by a single observer. Signal intensity threshold was chosen specifically for each scan to include the entire volume of interest, which was visually inspected in multiplanar reformats. All 52 MRI phantom acquisitions were segmented, once with AWS and once with 3D Slicer, making a total of 104 segmentations. Three CT phantom segmentations were performed with 3D Slicer (Fig. 2 to obtain true volumes. A total of 20 out of 52 acquisitions were randomly selected for second segmentation with AWS and 3D Slicer to assess intra-and intermethod agreement.
Five out of 11 Generation R subjects, with four acquisitions for each subject, were randomly selected for second segmentation with manual and semi-automated methods by the first observer and a second observer to assess intra-and interobserver agreement. Both observers were blinded to each other's segmentations.

Quantification of GD on phantom
MRI volume measurements were compared to CT measurements as the gold standard. The magnitude of GD was quantified as relative volume difference between MRI and CT measurements.
Phantom volume segmentations of the aforementioned MRI scan settings were compared to reference CT images using Dice score after rigid registration. Dice score measures volumetric overlap in the range between 0 (no overlap) to 1 (complete overlap) [28] and can be seen as a measure of shape similarity.
To assess intraobserver agreement, 20 randomly selected phantom acquisitions were segmented twice by the first observer.

Quantification of GD on patient's data
Volume differences between 2D and 3D Gradwarp datasets were computed as volume difference between the two end- inspiratory scans. Inspiratory scans were used because of easier segmentation of the volume of interest due to more homogeneous lung parenchyma signal intensity levels. As the entire Generation R cohort was only acquired with 2D Gradwarp, 3D Gradwarp correction performed with the built-in software of the scanner (3D GW scanner ) was compared with an offline software (3D GW off-line ). Further details are provided in the online supplement. Comparison of lung segmentation tools on subjects' data Lung segmentations from end-inspiratory scans were obtained using 3D Slicer (with threshold painting tool) [24], GeoS [25], Ivanovska [26,29], and Pennati [27] methods and compared to manual segmentation (MS) to assess their performance. At the time of analysis, the tested fully automated methods were not able to perform segmentation of end-expiratory scans due to decreased contrast differences between lung parenchyma and surrounding tissues. Consequently, only 3D Slicer and GeoS methods were compared to MS for endexpiratory scans. Mean segmentation time for endinspiratory and end-expiratory images was calculated for each method. Vital capacity (VC) computed with MRI (VC MRI ) by segmentation was compared to VC measured by spirometry (VC SPIROMETER ) to assess correlation. The smallest measured end-expiratory lung volume was subtracted from the largest measured endinspiratory lung volume to compute VC MRI . VC MRI was compared to the highest VC SPIROMETER . VC SPIROMETER was obtained from spirometry. VC MRI was calculated as the volume difference between inspiratory and expiratory levels.

Statistical analysis
Descriptive data were reported as means ± standard deviations. Q-Q plots and Shapiro-Wilk tests were used to test normality. Intraclass correlation (ICC) coefficient and Bland-Altman plots were used to assess intra-and interobserver agreement. Paired samples t test, Wilcoxon signed-ranks test, or Mann-Whitney U test was applied to assess differences in lung volume measurements. To compare semi-automated and fully automated segmentation methods with MS, volume differences were calculated as absolute and relative difference. Dice scores were used to measure overlap in shapes. Pearson correlation coefficient was used to determine correlation between VC MRI and VC SPIROMETER . P values ≤ 0.05 were considered to be statistically significant. Multiple comparisons were adjusted using Bonferroni correction. Statistical analyses were performed using SPSS v.21 (IBM SPSS Statistics).

Quantification of GD on phantom's data
Mean volume of each phantom bottle was 1196.77 ± 77.34 ml measured in CT. Figures 3 and 4 show the effect of GD on volume quantification for different MRI settings. For the reference MRI position with the phantom centered at the isocenter, volume differences were 6.91 ± 0.48 and 6.99 ± 0.22% with 2D and 3D Gradwarp correction, respectively. Electronic displacement of FOV showed volume differences of 6.93 ± 0.57 and 7.81 ± 0.64% with 2D and 3D Gradwarp, respectively. Manual displacement of the phantom showed volume differences of 6.97 ± 8.44 and 6.58 ± 1.42% with 2D and 3D Gradwarp, respectively. Table repositioning showed volume differences of 6.53 ± 0.57 and 6.94 ± 0.25% with 2D and 3D Gradwarp, respectively. Parallel imaging showed a volume difference of 1.80 ± 0.50% with 3D Gradwarp. However, bottles were not completely imaged with this acquisition due to the automatic FOV setting of the scanner, causing an underestimation of the true volume. Imaging using torso coil showed volume differences of 6.61 ± 7.10 and 5.56 ± 1.16% with 2D and 3D Gradwarp, respectively.
For electronic displacement, volumes of bottles on the right side of the phantom were consistently larger, independently of FOV positioning. Range of volume differences within the phantom was smaller with 3D Gradwarp than with 2D Gradwarp for all MRI settings. Figure 5 shows the effect of 2D and 3D Gradwarp. Bottles were more distorted, when distanced farther away from the isocenter.

Intra-and intermethod agreement
Phantom volume measurements with 3D Slicer and AWS showed high intra-and intermethod agreement (ICC = 0.991 and ICC = 0.994, respectively). No significant differences between segmentation tools (Z = -0.177, p = 0.86) were found. A Wilcoxon signed-ranks test with Bonferroni-adjusted alpha levels of 0.025 indicated that the first segmentations were significantly higher in volume than the second segmentations of the same data using 3D Slicer (mean difference = 4.48 ± 7.78 ml, Z = -5.450, p < 0.001). Similarly, the first segmentations had significantly higher volumes than the second segmentations using AWS (mean difference = 6.36 ± 5.54 ml, Z = -7.590, p < 0.001). Bland-Altman plots showed that measurements with 3D Slicer and AWS differed very little (Fig.  E3). Figure E4 represents the distribution of Dice scores with 2D and 3D Gradwarp. Mean Dice scores were 0.8593 ± Fig. 5 Images illustrate the effect of 2D and 3D Gradwarp. a CT reference image, b MR image with 2D Gradwarp, c MR image with 3D Gradwarp. MR images were obtained with phantom distanced 5 cm to the right of the scanner isocenter. Bending of bottles on the right side of the phantom (blue and green bottles) were seen when the bottles moved further from the scanner isocenter. With 3D Gradwarp, all bottles appear straight

Quantification of GD on patients' data
Mean end-inspiratory volume difference between 2D and 3D GW off-line scans was -0.91 ± 2.08%. Mean end-inspiratory volume difference between 2D and 3D GW scanner was 5.50 ± 9.62%. Mean end-inspiratory volume difference between 3D GW scanner and 3D GW off-line was 5.90 ± 9.71%. Based on phantom testing, volume difference between 2D and 3D Gradwarp using torso coil was -0.64 ± 5.59%. Therefore, mean volume difference between 3D GW scanner and 3D GW off-line was around 0.27 ± 11.93%.

Comparison of segmentation methods
A total of 176 segmentations were analyzed for accuracy, reproducibility, and time efficiency, of which 110 were end-inspiratory segmentations and 66 were end-expiratory segmentations.   Figure 6 shows an example slice and corresponding segmentations with each tested method. Segmentation of mediastinal structures and peripheral lung portions was found to be the source of variation, leading to volume differences between software measurements. Semi-automated and fully automated methods showed similar segmentation errors, namely inclusion of nonlung tissue (i.e., mediastinum) and exclusion of lung tissue at low signal-to-noise regions (i.e., lung's apices).

End-inspiratory segmentations
Semi-automated (ICC = 0.988-0.993) and fully automated segmentation results (ICC = 0.971-0.982) showed high agreement with MS (Tables 2 and 3). Results indicated that segmentations with GeoS, Pennati, and Ivanovska methods were similar to MS, with volume differences ranging from 0.59 to 1.37%. One subject was not segmented by Ivanovska's method due to motion artifacts. 3D Slicer showed a significant difference (p < 0.001) with volume differences up to 2.89%. Bland-Altman plots showed good agreement between semiand fully automated methods and MS (Fig. E5).

End-expiratory segmentations
Both semi-automated methods showed high agreement with MS (ICC = 0.992-0.995) ( Table 4). They had similar results as MS with mean differences of -1.27 and 1.81% for 3D Slicer and GeoS, respectively. Bland-Altman plots showed a slightly better performance for 3D Slicer than GeoS (Fig. E6).

Vital capacity
Spirometry data simultaneously obtained during the MRI scan were available for 8 out of 11 subjects, of which 4 showed large performance variability (Table 5). Association between VC MRI and VC SPIROMETER was similar but not significant for MS (r = 0.444, p = 0.271), GeoS (r = 0.440, p = 0.275), and 3D Slicer (r = 0.423, p = 0.297).

Segmentation time
Segmentation time for each method is shown in Table 2. Time displayed excludes time needed for file conversion, uploading, and saving steps. Manual segmentation time for end-expiratory images was shorter than end-inspiratory images, because of the shorter scan range due to lower volumes in expiration.
Both observers found MS a laborious task, aside from the considerable amount of time required for segmentation. Among the semi-automated methods, GeoS performed faster and was less laborious than 3D Slicer. Fully automated methods took approximately 2 to 3 min and required minimal user interaction.

Discussion
We found that mean volume differences from MRI relative to CT due to GD for an object centered at the isocenter of the scanner were 5.56 ± 1.16 and 6.99 ± 0.22% with torso and body coil, respectively. Moreover, we found high Dice overlapping scores for images with 2D Gradwarp and even higher with 3D Gradwarp. We also compared five segmentation methods with different complexity and interaction possibilities. We found that MRI systemically overestimates volume measurements compared to CT due to GD, with varying volume differences according to MRI setting and patient positioning. The range of volume differences within the phantom was always smaller with 3D than 2D Gradwarp, but mean difference was sometimes higher with 3D Gradwarp. This means that 3D Gradwarp normalized the intraobject volume differences but tended to increase volume overestimation. A discrepancy was found in the expected GD on volume measurements between electronic and manual displacement settings. Previous studies have shown that the greater the distance from the isocenter, the greater the GD [7,10]. While this was true for the manual displacement setting, it was not for the electronic displacement. Electronic displacement consistently generated a larger volume for bottles on the right side of the phantom. This may be due to asymmetrical inhomogeneity of the magnetic field.
The MRI system changes the FOV from 500 to 600 mm when 3D Gradwarp was applied. When smaller FOV was used with 2D Gradwarp, some bottles were not completely imaged, causing an underestimation of true volume. This problem can explain the lower volume differences found with 2D than with 3D Gradwarp phantom data, for electronic displacement and table repositioning settings.
Finally, end-inspiratory segmentations with GeoS, Pennati, and Ivanovska methods and end-expiratory segmentations with 3D Slicer and GeoS showed similar volume measurements to manual segmentation. Results from the present study suggest that fully automated methods can be used for end-inspiratory lung volume segmentations of large cohort studies, reducing segmentation time and effort without sacrificing accuracy. To date, no fully automated algorithms for end-expiratory images are available. Up to date, GeoS seems the fastest and most accurate semiautomated segmentation method for lung volume segmentation of end-expiratory images.
We acknowledge some limitations to this study: firstly, the small number of subjects. However, each subject had four lung MRI acquisitions, so 44 acquisitions were obtained to test five software methods. Secondly, many missing lung function data hamper data analysis. VC MRI correlated positively with VC SPIROMETER , but analysis of a larger subset is needed to confirm these results. This will be eventually achieved when the entire dataset of Generation R is segmented. Thirdly, only one MRI system was used to acquire the MR images. While this ensured homogeneity in the resulting MR images, this does not reflect the wide range of MRI systems and imaging sequences used in practice. However, our approach can be applied to other vendors and MRI sequences.

Conclusion
MRI systematically overestimates volume compared to CT due to GD. 3D Gradwarp images yield shapes that are similar to reference CT images, but can also determine larger volume overestimation. The effect of GD on volume measurements for images acquired with specific MR settings or in specific patient positions can be quantified and potentially be predicted and corrected.
Semi-automated and fully automated segmentation methods allow accurate, reproducible, and fast lung volume quantification using MRI. We conclude that chest MRI is a valid radiation-free alternative to CT to assess lung volume in large cohort studies. the Municipal Health Service, Rotterdam area, and the Stichting Trombosedienst and Artsenlaboratorium Rijnmond (Star-MDC), Rotterdam. We gratefully acknowledge the contribution of children and their parents, general practitioners, hospitals, midwives, and pharmacies in Rotterdam. We thank Mika Vogel (MRI site scientist at Erasmus MC of GEHC) for his technical support.
Funding The Generation R Study is made possible by financial support from the Erasmus Medical Center, Rotterdam, the Erasmus University Rotterdam, and the Netherlands Organization for Health Research and Development. Dr. Liesbeth Duijts received funding from the European Union's Horizon 2020 co-funded program ERA-Net on Biomarkers for Nutrition and Health (ERA HDHL) (ALPHABET project (no. 696295; 2017), ZonMW The Netherlands (no. 529051014; 2017)), and the European Union's Horizon 2020 research and innovation program (LIFECYCLE project, grant agreement no. 733206; 2016). The study sponsors had no role in the study design, data analysis, interpretation of data, writing, or submission of this report.

Compliance with ethical standards
Guarantor The scientific guarantor of this publication is Pierluigi Ciet.

Conflict of interest
The authors declare that they have no conflict of interest.
Statistics and biometry No complex statistical methods were necessary for this paper.
Informed consent Written informed consent was obtained from all subjects in this study.