Accuracy, repeatability, and reproducibility of T1 and T2 relaxation times measurement by 3D magnetic resonance fingerprinting with different dictionary resolutions

Objectives To assess the accuracy, repeatability, and reproducibility of T1 and T2 relaxation time measurements by three-dimensional magnetic resonance fingerprinting (3D MRF) using various dictionary resolutions. Methods The ISMRM/NIST phantom was scanned daily for 10 days in two 3 T MR scanners using a 3D MRF sequence reconstructed using four dictionaries with varying step sizes and one dictionary with wider ranges. Thirty-nine healthy volunteers were enrolled: 20 subjects underwent whole-brain MRF scans in both scanners and the rest in one scanner. ROI/VOI analyses were performed on phantom and brain MRF maps. Accuracy, repeatability, and reproducibility metrics were calculated. Results In the phantom study, all dictionaries showed high T1 linearity to the reference values (R2 > 0.99), repeatability (CV < 3%), and reproducibility (CV < 3%) with lower linearity (R2 > 0.98), repeatability (CV < 6%), and reproducibility (CV ≤ 4%) for T2 measurement. The volunteer study demonstrated high T1 reproducibility of within-subject CV (wCV) < 4% by all dictionaries with the same ranges, both in the brain parenchyma and CSF. Yet, reproducibility was moderate for T2 measurement (wCV < 8%). In CSF measurement, dictionaries with a smaller range showed a seemingly better reproducibility (T1, wCV 3%; T2, wCV 8%) than the much wider range dictionary (T1, wCV 5%; T2, wCV 13%). Truncated CSF relaxometry values were evident in smaller range dictionaries. Conclusions The accuracy, repeatability, and reproducibility of 3D MRF across various dictionary resolutions were high for T1 and moderate for T2 measurements. A lower-resolution dictionary with a well-defined range may be adequate, thus significantly reducing the computational load. Key Points • A lower-resolution dictionary with a well-defined range may be sufficient for 3D MRF reconstruction. • CSF relaxation times might be underestimated due to truncation by the upper dictionary range. • Dictionary with a higher upper range might be advisable, especially for CSF evaluation and elderly subjects whose perivascular spaces are more prominent. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-022-09244-x.

MRF applies pseudorandomized acquisition to create temporal incoherence, so that different tissues show unique signal evolutions, as the so-called fingerprint. These fingerprints are then matched to a predefined dictionary containing sets of predicted signal evolutions [7]. The dictionary thus becomes a crucial component of MRF reconstructions. A dictionary is determined by the number of anticipated tissues and combinations of system-related parameters. In general, dictionaries include T 1 and T 2 values [8]. Each parameter has ranges and step sizes, determining the dictionary resolution.
Three-dimensional (3D) MRF provides higher signal-tonoise ratio efficiency and spatial resolution compared to 2D MRF [9,10]. Combined with a parallel imaging technique, such as generalized autocalibrating partial parallel acquisition (GRAPPA), 3D MRF can provide whole-brain imaging within a feasible time [11]. Although various MRF reproducibility studies have been performed [12][13][14][15][16], the implementation of dictionaries with differing ranges and step sizes on 3D MRF accuracy, repeatability, and reproducibility is not yet thoroughly investigated. This study investigated a 3D MRF reconstructed using various dictionary resolutions to evaluate measurement accuracy, repeatability, and reproducibility in the phantom and human brain.

Materials and methods
This prospective study was performed in accordance with the Declaration of Helsinki and was approved by the local Ethics Committee. All volunteers provided written informed consent prior to enrollment.
Four dictionaries with equal ranges but different step sizes were used for pattern matching. The highest resolution dictionary, denoted as HRD, had 300 T 1 entries ( . Additionally, we also performed experiment with a much wider range dictionary and finer step size, denoted as very high-resolution dictionary (VHRD), which had 1150 entries for both T 1 and T 2 (1:1:100, 102:2:1000, 1010:10:7000).
The dictionary generation is based on the extended-phasegraph (EPG) model, as described by Weigel [17]. For the estimation of the quantitative mappings, 3D volumes of multiple time points were normalized and pattern-matched voxelwise to the corresponding dictionary using the maximum inner product method. MRF reconstruction was performed using a workstation with the following specification: CPU Intel Core i7 7800x 3.50 GHz; memory, 32 GB; GPU, Nvidia Geforce RTX 2080 Ti 11 GB.

Phantom study
The International Society of Magnetic Resonance in Medicine/National Institute of Standards and Technology (ISMRM/NIST) phantom was scanned daily for 10 days in both scanners. This phantom consists of 14 spheres of each T 1 and T 2 array with specific values (T 1 , 23-1838 ms; T 2 , 5-646 ms). We determined to evaluate the measured values of phantom arrays number 1 to 7 (T 1 , 259-1838 ms; T 2 , 63-646 ms) as those arrays represent physiologic ranges of T 1 and T 2 values in the human brain. Reference T 1 and T 2 values of phantom arrays were obtained from the phantom manual provided by the manufacturer. We also evaluated the MRF measurement consistency in the boundary slices outside the phantom spheres. The phantom temperature before and after acquisitions was measured, and the average of the two was recorded.

Volunteer study
Between October 2019 and April 2020, 39 healthy volunteers without any known neurological disease were enrolled. Twenty volunteers underwent MRF scans in both scanners, and the remaining underwent scans in either scanner. T 1weighted images were also obtained using a 3D magnetization-prepared rapid acquisition of gradient echo (3D-MPRAGE) sequence with the following parameters: TR, 1900 ms; TE, 2.6 ms; inversion time (TI), 900 ms; FA, 9°; FOV, 230 × 230 mm; matrix size, 256 × 256; slice thickness, 0.9 mm; and parallel imaging factor, 2.

Post-imaging processing
A circular region of interest (ROI) 10 mm in diameter was drawn in the center slice of each T 1 and T 2 phantom array using ImageJ application (https://imagej.nih.gov/ij/). As the T 1 arrays are located near the apex of the phantom and T 2 arrays near the midline, the axial cross-section area for T 2 arrays is larger than T 1 arrays. Therefore, we opted to mask the phantom area with the same size for T 1 and T 2 arrays. In addition, two sets of 4 circular ROIs with the same diameter were drawn, each set in the upper and lower boundary slices outside the phantom spheres. Mean relaxation times were measured using the built-in function of ImageJ.

Statistical analysis
Statistical analyses were performed using commercially available software (MedCalc version 20.0; MedCalc Software Ltd.). Accuracy was assessed by linear regression and Bland-Altman (BA) relative difference plots between measured relaxation times (average from 10 days of measurements) and phantom reference values. The coefficient of variation (CV) over 10 days of scanning was calculated to evaluate repeatability. Reproducibility was determined by calculating BA plots, intraclass correlation coefficients (ICCs), and within-subject CVs between MRF maps from different scanners in phantom and healthy volunteers. Mean T 1 and T 2 values of brain VOIs between dictionaries were compared using ANOVA.
Dictionary with a much wider range and finer step-sizes (VHRD) demonstrated similar accuracy and interscanner reproducibility, yet seemingly higher 10 days CV of T 2 measurements (5.3%), as shown in Supplementary Table 1. The MRF repeatability and reproducibility in the boundary slices ROIs were lower than those of phantom arrays located in a more central slice, with the 10-day CV of 3.3-4.1% (T 1 ) and 4.9-8.1% (T 2 ) and interscanner CV of 4.3-4.9% (T 1 ) and 8.0- Fig. 1 Example of T 1 map overlaid with volumes of interest selected from FreeSurfer segmentation results Fig. 2 Representative T 1 and T 2 maps of the ISMRM-NIST phantom, reconstructed using the high-resolution dictionary (HRD; first column from left), moderately low-resolution dictionary (LRD-1; second column), low-resolution dictionary (LRD-2; third column) and very low-resolution dictionary (LRD-3; fourth column), scanned in two MR scanners. Window width was set at 2400 ms and window level at 1200 ms for T 1 maps and at 800 ms and 400 ms for T 2 maps. T 1 and T 2 maps of 4 dictionaries showed apparently similar results 10.7% (T 2 ). These findings were described in more detail in the Supplementary Appendix.

Volunteer study
All 39 volunteers (19 men, 20 women; mean age, 26.2 ± 4.1 years) were included in the final analysis. MRF template matching for a whole-brain scan requires around 2.4 h, 32 min, 12.6 min, and 6.3 min for HRD, LRD-1, LRD-2, and LRD-3. Meanwhile, a much broader and finer dictionary, VHRD, requires almost 59 h for template matching. Aside from template matching, the whole reconstruction process for each dictionary requires additional 3.1 h for raw data loading and processing. MRF T 1 and T 2 maps of one representative subject reconstructed using each dictionary are shown in Fig. 5. Average normalized MRF maps from all dictionaries demonstrated consistencies in most brain parenchymas (Supplementary Figure 1). Mean T 1 and T 2 values of each VOI are shown in Table 1. There is no significant difference between the four dictionaries measurements in all VOIs (ANOVA, p > 0.05).
Interscanner reproducibility metrics for T 1 and T 2 measurements from all dictionaries were equal, as shown in Table 2. In BA analyses of brain parenchyma VOIs, relative mean differences were around 1.3%, with 11.1-11.2% limits of agreement (T 1 ) and 0.3% with 19.4% limits of agreement (T 2 ), as shown in Supplementary Figure 2a. ICCs were almost uniform at 0.953 (T 1 ) and 0.954 (T 2 ) for all dictionaries, except    LRD-3 (T 2 , 0.953). The average within-subject coefficient of variation (wCV) was comparable across four dictionaries, around 1.6% (T 1 ), 4.6% (T 2 ) for white matter (WM) and 3.4% (T 1 ), 5.5% (T 2 ) for grey matter (GM). No notable difference was found across dictionaries for cerebrospinal fluid (CSF) VOIs (Table 2). CSF T 1 measurement had a relative mean difference of 1.1% with 9.7-9.8% limits of agreement, while T 2 had a nearly 0% relative mean difference with 25.9-26.4% limits of agreement (Supplementary Figure 2b). ICCs were substantially lower than the brain parenchyma VOIs, but were similar across dictionaries, with 0.65 for T 1 and 0.80 for T 2 . The average within-subject coefficient of variation (wCV) was about 2.5% for T 1 and 7.6-7.7% for T 2 .
However, we got higher variabilities in CSF VOI using a wider range dictionary, VHRD (Supplementary Table 2). VHRD T 1 relative mean difference was 2.0% with 17.5% limits of agreement, compared to 1.1% with 9.8% limits of agreement for HRD, and T 2 relative mean difference was 1.6% (limits of agreement, 48.8%) compared to HRD mean bias of nearly 0% (limits of agreement, 26.4%). Seemingly better reproducibility metrics of HRD than VHRD in CSF VOI were also evident in wCV and ICC analyses. HRD wCVs were 2.5% and 7.7% with ICCs of 0.650, 0.797 while VHRD wCVs were 5.0% and 13.3% with ICCs of 0.511 and 0.525 for T 1 and T 2 measurements, respectively. The mean T 1 value of CSF was markedly higher on VHRD (4273 ms) than HRD (3727 ms), with maximum T 1 values of 7000 ms and 4500 ms, for VHRD and HRD, respectively. These maximum values were ascribed to the upper range of each dictionary. Notably, HRD truncated a certain range of T 1 and T 2 values in CSF VOI (Supplementary Figure 3).

Discussion
This study investigated 3D MRF with four dictionary resolutions with the same ranges and evaluated three important quantitative metrics of imaging performance: accuracy, repeatability, and reproducibility. The phantom study showed that MRF has high accuracy for T 1 measurement and modest accuracy for T 2 measurement. MRF repeatability over 10 days of scanning was good. Meanwhile, reproducibility was evaluated through phantom and volunteer studies, showing good reproducibility. High interscanner reproducibility, particularly for T 1 estimation, was also demonstrated in this study. Additionally, 3D MRF with a broader range dictionary was also investigated.
The high accuracy of MRF T 1 estimations with lower accuracy for T 2 estimations was evident in this ISMRM/NIST phantom study, consistent with previously reported results [10,19], regardless of dictionary resolution. The supplementary information of the 2D MRF study by Ma et al [1] described that T 1 and T 2 measurement accuracies were not significantly affected by dictionary resolution. Our phantom results with 3D MRF supported the previous evaluation with 2D MRF. In terms of repeatability, comparable CVs were evident for all dictionaries in T 1 and T 2 measurements (T 1 , CV < 3%; T 2 , CV < 5%) compared to the 3D MRF study by Ma et al, which implemented B 1 correction (T 1 , CV < 4%; T 2 , CV < 7%) [10]. Compared to T 1 measurements, a lower interscanner reproducibility on T 2 measurements was noted with all dictionaries. Such lower T 2 reproducibility has been reported in the MRF literature using 2D-SSFP MRF (T 1 , CV 0.2% vs. T 2 , CV 0.7%) [13]. In the present phantom study, there were no marked differences in Table 2 Interscanner reproducibility of 3D MRF in the human brain, evaluated using Bland-Altman plots, ICCs, and wCV across dictionaries with the same ranges (HRD, LRD-1, LRD-2, LRD-3). Despite the number of entries of each dictionary being substantially different, all dictionaries demonstrated comparable metrics, both in the brain parenchyma and CSF VOIs Dictionary HRD LRD-1 LRD-2 LRD-3 Brain parenchyma Bland-Altman Mean difference (%) accuracy, repeatability, or reproducibility between dictionaries with differing resolutions and ranges.
In the human study, all dictionaries with the same ranges obtained comparable T 1 and T 2 values for all brain VOIs. Nevertheless, higher variability was observed in CSF relaxation values measurement than in brain parenchyma, consistent with the prior report [13,20]. Lower reproducibility on CSF relaxometry might be attributed to lower SNR of CSF due to the relatively high flip angle of MRF pulse [13], CSF flow effects [20], and physiological inhomogeneity [21].
In addition, significant differences in CSF relaxometry measurements were seen between dictionaries with different ranges. We considered the upper dictionary limit a major contributing factor. Four dictionaries with the same ranges potentially underestimate CSF relaxometry by truncating the upper limit at 4500 ms, as VHRD measured CSF T 1 values exceeding 4500 ms in nearly half of the subjects. Previous studies using various T 1 mapping techniques have reported similar CSF T 1 values with VHRD measurements (> 4000 ms) [22][23][24]. Accordingly, FISP-MRF literature with lower dictionary upper ranges (3000-4500 ms) also described lower T 1 values for CSF [14,25].
More importantly, to various extents, CSF relaxometry will also affect relaxometry measurements of brain parenchyma due to the partial volume of subarachnoid CSF, perivascular spaces, and subvoxel water content. Our data showed that in most brain parenchymal VOIs, measured maximum values differed significantly, with each close to the upper range of the respective dictionary. In elderly subjects, in whom perivascular spaces are more abundant than in our relatively young study participants, this difference could also affect mean T 1 and T 2 values. Thus, a broader range dictionary with a higher upper limit of T 1 (> 4500 ms) might be advisable, especially for elderly subjects.
The seemingly better reproducibility shown by four dictionaries with the same ranges compared to VHRD in CSF T 1 value measurements may be attributable to the truncation of the dictionary's upper limit, yielding less variability in the measured relaxation times. Our MRF reproducibility might also have been affected by B 1 variation. Nonetheless, our MRF measurements for grey and white matter were comparable to values reported in the literature using spin-echo techniques [24,25], with interscanner CVs kept under 6%. A 3D MRF study without B 1 correction reported interscanner CVs under 10% for T 1 and T 2 in solid brain compartments [26].
Despite the improved estimation performance, the computational expense of the dictionary-based method remains a significant barrier to MRF application in clinical practice, where parameter maps should be generated quickly. The dictionary matching process for a large dictionary (e.g., HRD and VHRD) might take more than 1 h in a typical workstation with GPU. For these reasons, the number of entries should be kept under control. Nevertheless, it must be carefully defined to fully cover the physiologic range of target tissue and avoid being excessively sparse, resulting in unacceptably large biases. The lowest resolution dictionary in our study required around 6 min for the template matching process with comparable accuracy, repeatability, and reproducibility.

Limitations
Our study had several limitations. First, we did not perform any B 1 correction measures, as this option was unavailable during subject enrollment. Nonetheless, the sequence became available later, and we conducted an additional phantom experiment with B 1 correction. However, unlike prior studies [9,10,27], we experienced banding artifacts in the reconstructed MRF maps after B 1 correction due to inaccurate B 1 maps, which significantly affected the measured value. Also, the implemented 3D MRF sequence does not include any measures to correct flip a n g l e e r r o r d u e t o i m p e r f e c t s l a b p r o f i l e . Correspondingly, lower T 1 and T 2 measurement consistencies were evident in our phantom experiment using boundary slices, as described in the Supplementary Appendix.
Underestimation of T 2 values was notable in all dictionaries, particularly in the two highest T 2 phantom arrays. Such underestimation was consistently reported in previous studies utilizing the FISP-MRF sequence [10,12]. As this underestimation persisted even after the B 1 correction [12], other causes should be considered. Moreover, this effect is not apparent in different sequences, such as MRF based on echo-planar imaging (MRF-EPI) [28]. Kobayashi et al found that the T 2 underestimation resulted from the diffusion weighting caused by the spoiler gradient used in the FISP-MRF [29]. Incorporating the diffusion effect of the spoiler gradient and ADC maps into the dictionary might be one potential solution. Furthermore, in the phantom and human study, interscanner reproducibility of T 2 estimation was notably lower than T 1 . Such findings were consistently reported in prior MRF literature [12][13][14] and might pose a challenging task for future MRF optimization.
Second, we did not perform a scan-rescan of each subject in the same scanner. Intra-scanner repeatability thus could not be assessed. Also, as scans in different MR units were taken on slightly different days, physiological variations may have affected interscanner reproducibility as confounders. Physiological differences can affect brain morphometry even in same-day scans [30].
Last, the age range of volunteers in our study was relatively narrow and elderly participants were not included. Further studies are needed to investigate the clinical value of a broader MRF dictionary range, particularly in scans of elderly individuals, in whom perivascular space and subvoxel water content are more prominent.

Conclusion
In conclusion, our study demonstrated that 3D MRF offered good accuracy, repeatability, and reproducibility, particularly for T 1 value estimation, with comparable performance across different dictionary resolutions. A dictionary with a lower resolution but a well-defined range may be adequate, resulting in significant reductions in computational load.