How reliable are ADC measurements? A phantom and clinical study of cervical lymph nodes

Objective To assess the reliability of ADC measurements in vitro and in cervical lymph nodes of healthy volunteers. Methods We used a GE 1.5 T MRI scanner and a first ice-water phantom according to recommendations released by the Quantitative Imaging Biomarker Alliance (QIBA) for assessing ADC against reference values. We analysed the target size effect by using a second phantom made of six inserted spheres with diameters ranging from 10 to 37 mm. Thirteen healthy volunteers were also scanned to assess the inter- and intra-observer reproducibility of volumetric ADC measurements of cervical lymph nodes. Results On the ice-water phantom, the error in ADC measurements was less than 4.3 %. The spatial bias due to the non-linearity of gradient fields was found to be 24 % at 8 cm from the isocentre. ADC measure reliability decreased when addressing small targets due to partial volume effects (up to 12.8 %). The mean ADC value of cervical lymph nodes was 0.87.10-3 ± 0.12.10-3 mm2/s with a good intra-observer reliability. Inter-observer reproducibility featured a bias of -5.5 % due to segmentation issues. Conclusion ADC is a potentially important imaging biomarker in oncology; however, variability issues preclude its broader adoption. Reliable use of ADC requires technical advances and systematic quality control. Key Points • ADC is a promising quantitative imaging biomarker. • ADC has a fair inter-reader variability and good intra-reader variability. • Partial volume effect, post-processing software and non-linearity of scanners are limiting factors. • No threshold values for detecting cervical lymph node malignancy can be drawn. Electronic supplementary material The online version of this article (10.1007/s00330-017-5265-2) contains supplementary material, which is available to authorized users.


Introduction
Recent advances in medical imaging technology and drug therapeutics have accelerated the emergence of new quantitative imaging biomarkers (QIB) [1,2]. The multiplication of these QIBs is unfortunately not always accompanied by Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00330-017-5265-2) contains supplementary material, which is available to authorized users. stringent validations establishing that QIBs are well designed to characterize a disease and its changes with therapy. This lack of validation creates a situation where QIBs are routinely used but with limited knowledge of their performances, precluding a larger adoption in clinical trials.
Apparent diffusion coefficient (ADC) can quantify the level of free water diffusion restricted by an increase in tissue cellularity. Applications of ADC in cancer imaging has motivated intensive research and ADC is now one of the main QIBs derived from diffusion MRI.
Several studies have documented the incremental value of ADC assessment as a complement or substitute to standard sequences for the detection of malignant tumours [3], the degree of malignancy [4,5] or to evaluate response to treatment [6][7][8].
Previous literature comprises heterogeneous studies protocols and results [14]. Several sequential unitary processes are necessary to output an ADC assessment, the lack of reliability of any of these unitary processes is likely to degrade the final ADC assessment. It is therefore particularly relevant to study if ADC qualifies as a quantitative biomarker.
Over the last decade, a multidisciplinary community has organized retrospective investigations of QIBs starting by documenting methodologies [2]. In 2007, the Radiological Society of North America (RSNA) launched QIBA (Quantitative Imaging Biomarker Alliance [15]), a specialized working group aiming at improving the value and usefulness of QIBs in reducing variability across devices, patients, and practices.
One of QIBA aims consists in releasing 'Profiles', which are documents standardizing imaging protocols to obtain optimal, reliable and reproducible biomarker measures according to the current state of the art. The QIBA diffusion imaging profile is still a work in progress [16].
QIBA also proposes a standardized protocol for quality control in diffusion imaging, using a diffusion phantom [17,18] consisting of a volume of 0°C stabilized water as the reference value for ADC assessment [19,20].
The main objective of this study was to evaluate the variability of ADC measurements in vitro on a phantom and in vivo on cervical lymph nodes. The secondary objective was to understand and quantify ADC measurement errors, in view of correcting them in future studies.

Methods
We first tested QIBA metrics for quality control (QC) of ADC image quality, and then performed a reliability analysis of ADC measurements. Finally we measured ADC values of cervical lymph nodes in healthy volunteers.
This prospective study was conducted at the Centre Antoine Lacassagne, cancer centre in Nice, France, between March and November 2016. We used a GE MRI scanner 1-5T MR450W and ADW Volume Share 5 4.6 software to process images (GE Healthcare).

Quality control test
We used a DIN 6858-1 PET-CT phantom (PTW) consisting of a cylindrical Plexiglas body filled with a mixture of ice and water. Three smaller cylinders were inserted into the body, one of which was filled with water at 0°C (Fig. 1, left side).
Homogeneity of temperature inside the cylinder was thermometer-controlled according to the process defined into the QIBA profile to achieve thermal equilibrium (>1 h) over the entire MRI exam period. For each b value, four successive acquisitions spaced in time from more than 12 min were performed, allowing retrospective checks.
The diffusion protocol was 3three directions, DW SS-EPI with b=0, 100, 600, 800 s/mm 2 , TR=9,451 ms, TE=80 ms, Number of average = 2, FOV 320*320 mm, contiguous slice thickness of 4 mm, encoding frequency axis R/L. Four successive acquisitions were made for each b value, the phantom symmetry axis was laser-centred to the magnetic field positioning the 0°C water cylinder at the center of the scanner. Acquisitions of the phantom were performed horizontally (x-axis) and vertically (y-axis). We measured circular regions of interest (ROIs) of 2.5 cm diameter and composed of 123 voxels (Fig. 2). Mean ADC and standard deviation (SD) were computed.
According to the equations in Table 1, we computed the measurement repeatability (R), estimated by the coefficient of variation (CV R ) and the repeatability coefficient (RC R ), the accuracy (ADC Bias estimate), ADC noise estimate and bvalue dependency.
The signal-to-noise ratio (SNR) was computed using formula F (shown in Table 1) and involved computing the 'Temporal Noise Image' from the diffusion mapping at b = 0, with a 2-cm circular ROI.
Results were compared to QIBA 's references values [16].
In addition, we analysed the planar spatial correlation of ADC measures in shifting ROIs along the x and y axis. The ADC reference value was measured at the image center using formula C (see Table 1). We used circular ROIs of 2.2-cm diameter and 2-cm shifts from the centre either to the right (x-axis) or to the bottom (y-axis) of the image.
We simulated clinical conditions in using the cervical level of the routine whole-body MRI, i.e. axial DW SS EPI with b=50 and b=1,000 s/mm 2 , TR=10,384 ms, TE set to minimum (around 70 ms for all scans). Number of averages=2, parallel imaging factor=2, FOV=400*400 mm, contiguous 5-mm slice thickness, encoding frequency axis R/L. The phantom was laser centred, equidistant from all spheres. Four acquisitions were made at 1-day intervals. All values were averaged over 4 days.
ADC measures were obtained from spherical volumes of interest (VOIs) centred on spheres (Fig. 3).
The relative ADC error was computed for each sphere size, considering that the reference ADC value was from the 37mm sphere. We analysed the correlation between VOI size and precision of measurements in computing the CV R . Additional analysis documented the measurement error, first in measuring bias, second in computing the CV R through several concentric VOIs of decreasing size in the largest sphere, according to Table 1 (Formula A). Then partial volume effect was quantified by calculating the relative error within a VOI with a diameter equal to 80 % the diameter of a sphere compared to a VOI of identical size within the largest

In vivo study
Informed consent was obtained from 13 healthy volunteers. Exclusion criteria were chronic disease, history or ongoing symptoms of infection like fever, cough, rhinorrhoea, dysphagia and odynophagia, history of cervical surgery, claustrophobia and all usual contraindications for MRI. Demographic status and smoking habits were recorded for the 13 volunteers. Volunteers were scanned using the same machine as the phantom study. The acquisition was performed with a neck phased-array coil and the volunteer was instructed to breath normally.
Technical settings of diffusion sequence for volunteers were identical to those of the SPHERE phantom.
Two readers assessed ADC values of lymph nodes: a senior radiologist with more than 6 years of experience in cancer imaging and a junior radiologist.
Lymph node volumes were manually segmented on the b1000 scan, and the graphic was exported to the ADC map ( Fig. 4). At least four lymph nodes were selected per volunteer, including the largest. VOIs were segmented in delineating hyper-intense diffusion areas on b1000 scans while excluding lymph nodes hilum. Each node was segmented twice by each observer using the same acquisition with an interval of 7-60 days (mean 41 days) [21]. Mean and SD ADC values were recorded.
Inter-and intra-observer agreements were calculated according to the Bland Altman method using R CRAN software. Bias and limit of agreement (LoA) were computed. Inter-and intra-observer differences in segmenting lymph node volumes and ADC values were analysed using the sum of Wilcoxon rank for paired values test. A p-value < 0.05 was considered significant.

Results
Quality control test Table 2 shows QIBA's recommended limit values for repeatability, accuracy, precision and b-value dependency.
We found that SNR computed from diffusion scans at b = 0 was 17 1, lower than the recommended limit value (50 1).
Variations in ADC measures relative to spatial positions are summarized in Fig. 5.
We also found an increasing bias when shifting measurements from the scan isocenter (Table 3). The maximum 10 % error threshold recommended by QIBA was exceeded for ADC values measured at least 6 cm distant from the isocentre ( Table 3).

SPHERE phantom study
Firstly, our analysis showed that when VOIs are set within spheres of decreasing size, relative error and measurements variability of ADC measurements increased (Table 4). Secondly, we found no significant mean ADC difference for VOIs of decreasing sizes set within the largest sphere. In thoses case, we found less than 2 % error between the largest and smallest VOIs.
Correlation and variability analysis of ADC measurements with VOI size seemed to indicate a significant partial volume effect. Partial volume effect was visually confirmed on images. Overall, 54 cervical lymph nodes were selected for analysis mainly on carotid-jugular sites, with a mean volume of 1 cm 3 (Appendix 1). The mean value of measured ADC was 0.87 × 10 -3 mm 2 /s (0.66-1.28 .10 -3 mm 2 /s, SD was 0.12 .10 -3 mm 2 /s). We found a significant difference between the average ADC values measured by readers 1 and 2 (0.84.10 -3 and 0.90.10 -3 mm 2 /s, respectively, p <0.0001).
We found a significant difference in average segmented volumes between readers 1 and 2 (respectively 1.18 +/-0.94 cm 3 and 1.92 +/-1.23 cm 3 , p <0.0001). There was a low correlation between measurement differences in terms of average ADC and volume segmentation (R 2 = 0.37) by the two observers.
Using the Beaumont et al. method [22] and based on our intra-observer reproducibility parameters, we can estimate that on longitudinal studies under strict reproducible conditions (same patient, same reader), a meaningful relative change of ADC value should be outside the range [-13 %; + 15% ].

Discussion
Our QC results showed good compliance with QIBA metrics, except for ADC bias estimate, which was slightly above the limit, and with a variability of about 9 %. Results were independent of the value of b.
We questioned if the main part of error was due to our phantom design featuring a large off-axis volume of water and thermally suboptimal materials. Using repeated imaging of the phantom, we found, however, a good repeatability, suggesting acceptable thermal equilibrium.
SNR was also lower than QIBA's recommendation, but Malyarenko et al. [18] reported that low SNR has no impact on ADC assessments. Very low SNR without adequate postprocessing would probably alter measurements as most software (including the one we used) compute ADC images in thresholding/removing low intensity voxels.
We highlighted a correlation between ADC measurement error and the distance of ROIs from the magnetic centre. The error increased with bottom-shift (up to 24 % when located 8 cm out of isocentre). Conversely, with regard to lateral-shift we found no correlation with the magnitude of errors. This result can be explained by non-uniformity of gradient-fields.
As we found a correlation between variability of ADC assessments and contour segmentations, we concluded that partial volume effect was a major contributor to the variability.
A visual review of outliers in clinical data confirmed high variation of signal intensity in the tissue surrounding these lymph nodes. Consequently, even a small variation in segmentations led to a significant modification of ADC assessments (Fig. 6) We recommend measuring ADC by drawing ROIs smaller than the anatomical limits of the area of interest. How to optimize segmentation margins must be further investigated.
We found excellent repeatability and good reproducibility. This suggest that, if ADC intended to evaluate response to treatment, changes inferior to [-13 %; +15 %] may not be clinically relevant. However, longitudinal reproducibility would require further clinical studies to take into account all variability factors. According to our dataset, the averaged ADC value for healthy subject's cervical lymph nodes was 0.87.10 -3 ± 0.12 .10 -3 mm 2 /s.
Our results are well supported by the literature. Regarding the correlation between variability of ADC assessments and contour segmentation, heterogeneous segmentation methods are available but several studies documented the reproducibility issues [23][24][25] affecting these methods. These different approaches are also reported as cumbersome and time-consuming [26].
Specific phantom studies have shown that gradient-field error would be scanner-dependent [27] and not significant within 4 cm from the isocentre, explaining the good reproducibility of our ADC measurements and in other multicentric Outcome of the quality control after imaging the ICEWATER phantom. The test was done with different b values as b0-b100, b0-b600 and b0-b800. Tests not meeting QIBA quality claims are displayed in bold studies. On multiple scanners, measurements at 12 cm from the isocentre showed an average error of -20 % according to vertical shift and +7 % horizontally. Unlike our observation of an ADC value of 0.87.10 -3 ± 0.12 .10 -3 mm 2 /s in healthy subject's cervical lymph nodes, Kwee et al. [12] reported a range of [1.15 10 -3 mm 2 /s; 1.18 10 -3 mm 2 /s], with similar intra-and inter-observer variabilities. A review of 12 studies including more than 1,200 benign lymph nodes report ADC values of 0.302 ± 0.062.10 -3 mm 2 /s in inflammatory cervical nodes [28] and 2.38 ± 0.29.10 -3 mm 2 /s for abdominal nodes [29]. Kwee et al. concluded that disparity of results could be due to the various segmentation methods used.
Our results for non-diseased ADC values overlap with metastatic or lymphomatous lymph nodes ADC measures [0.410 ± 0.105 .10 -3 mm 2 /s; 1.84 ± 0.37 .10 -3 mm 2 /s] as reported by other groups [28,29]; however, the ADC values we found match with other non-diseased ADC studies. A radiologicalpathological correlation study by Vandecaveye et al. [30] on 331 cervical lymph nodes proposed an ADC threshold of 0.94 10 -3 mm 2 /s for detecting node malignancy featuring a specificity of 94 %. According to this threshold, 72 % of our data would have been misclassified. The use of ADC values to assess cervical lymph nodes malignancy does not reach a consensus.
Other variability factors are described in the literature.

1)
Inter-scanner variability. Some authors [18] report less than 3 % variability while others [31] conclude that 80 % of scanners featured less than 5 % error. Another group [32] reports that the CV R was 1.5 % on phantoms, and less than 4 % for cerebral parenchyma. Impact of acquisition parameters has also been investigated [33][34][35]. 2) Impacts of post-processing software were documented by Zeilinger et al. [36], reporting up to 8 % variation when ADC was processed by four different types of software. This limitation precludes longitudinal assessment of patients across different centers.
In order to address the limitations we found, variability can be minimized by standardizing ADC assessments. Efforts from scanner manufacturers are needed [8] to ease the Spatial variations of ADC measurements with respect to a reference VOI at center of the magnetic field. Top rows ADC measurements are shifted horizontally. Bottom rows ADC measurements are shifted to the bottom. In bold are the shift values corresponding to the distance from magnetic centre in cm.We found that ADC measurement did not change significantly when shifted right (Pearson coefficient=0.25). In opposite, ADC values increased when measurements were shifted vertically (Pearson coefficient=0.95) calibration of diffusion sequences. Also, initiatives from the scientific community and technological improvements are expected to avoid sequences artifacts and systematic errors. Correction of non-linear gradient fields, one of the relevant issues, is the focus of ongoing research [37,38].
Design of standardized validation methodologies and systematic quality control [39] are key to reach acceptable levels of compliance. To this end, a commercial version of diffusion phantom has been developed by QIBA [40] and an automatic quality control software is under development.
From early quantitative evaluation of diffusion, other approaches have emerged.
IVIM (Intra-Voxel Incoherent Motion) is aiming to split the different components of ADC [41]. Some authors indicate that pure molecular diffusion would be more meaningful than ADC as being independent from the perfusive component [42]. Therefore, IVIM would enable assessing tissue for which ADC is conceptually limited. Other studies develop alternative diffusion-derived QIBs in using the non-Gaussian distribution of diffusion kurtosis, Q-ball imaging and spectrum analysis in particular.
These approaches also require metrological analysis before scientific validation can be obtained, although the scientific literature shows potential added value of ADC.
The physiopathology of ADC changes still remains unknown. Some malignant tumours may feature increased ADC with respect to healthy tissue, either by spontaneous necrosis or cystic transformation, or by the destruction of a parenchyma with spontaneously low ADC. Some benign tumours may lead to ADC restriction; this is the case with Warthin tumours of salivary glands due to their cellular wealth, which is superior to the normal salivary tissue [43].
Lastly, microscopic changes triggered by anti-cancer treatments may interfere with ADC assessment follow-up [44] post-chemotherapy cytotoxic oedema leading to increased restriction of ADC despite a therapeutic response, and delayed fibrosis that may lead to suspicion of recurrence by dropping the ADC. Other confounding factors such as the appearance of extracellular oedema by hyper-hydration or by regional venous obstacles may also mislead the evaluation of therapeutic responses.
We found several limitations in our study. Phantom designs are not optimal and may have biased our results, limiting interpretations. First, regarding our ICEWATER phantom, even if the thermal equilibrium seemed satisfactory, a 1°C error could lead to a 2.4 % ADC error, explaining the slightly higher value found compared to QIBA recommendation. Second, with regard to the SPHERE phantom: (1) without To be noted Regarding right side column, unlike for other measurements, true size sphere and VOI have same size (10mm) because of the sampling limit (8 Voxel into the VOI). As a consequence, a nonlinear effect has been observed Decrease of the Relative error and the mean ADC value. This point is further developed in Table 5  thermal control, measurements were relative, limiting generalizability, (2) comparison of distant ROIs without gradient field correction was prone to bias, and (3) analysis of partial volume effect can be affected by phantom material, gas and features. Third, we did not evaluate the impact of artifacts [45]. QIBA recommends the use of corrective methods for liver and kidney analysis [8], although other groups [46,47] report no improvement in using such methods. Fourth, the monocentric design of our study limits the generalizability of our results. At this time, variabilities from different sources preclude a larger adoption of the ADC biomarker even though it is an important advance in cancer imaging. Generalizing quality controls and standardization of measurements is crucial to overcome these ADC variability issues.
Acknowledgements We acknowledge Catherine Klifa for her multiple reviews and scientific advices. All authors have equally contributed to this work.
Funding The authors state that this work has not received any funding.

Compliance with ethical standards
Guarantor The scientific guarantor of this publication is Dr. Bastien Moreau, PhD.
Conflict of interest Hubert Beaumont, as co-author of this manuscript, declares relationships with the following companies: Median Technologies.
All other authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.
Statistics and biometry One of the authors has significant statistical expertise.
Informed consent Written informed consent was obtained from all subjects (patients) in this study.
Ethical approval Institutional Review Board approval was obtained.

Methodology
• prospective • experimental • performed at one institution Open Access This article is distributed under the terms of the Creative Comm ons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Fig. 6 Example of inter-observer discordance in terms of volume and apparent diffusion coefficient (ADC). Example of inter-observer discordance in terms of volume and ADC on a level II lymph node. Top row First reader's measurements. Bottom row Second reader's measurements. Left b1000 diffusion maps where volumes of interest (VOIs) are drawn. Right corresponding ADC maps. Note: The heterogeneity of the node's environment featuring areas of high ADC values (green and red in the right images) without clear correspondence on the diffusion image