A comparison of FDG PET/MR and PET/CT for staging, response assessment, and prognostic imaging biomarkers in lymphoma

The aim of the current study was to investigate the diagnostic performance of FDG PET/MR compared to PET/CT in a patient cohort including Hodgkins lymphoma, diffuse large B-cell lymphoma, and high-grade B-cell lymphoma at baseline and response assessment. Sixty-one patients were examined with FDG PET/CT directly followed by PET/MR. Images were read by two pairs of nuclear medicine physicians and radiologists. Concordance for lymphoma involvement between PET/MR and the reference standard PET/CT was assessed at baseline and response assessment. Correlation of prognostic biomarkers Deauville score, criteria of response, SUVmax, SUVpeak, and MTV was performed between PET/MR and PET/CT. Baseline FDG PET/MR showed a sensitivity of 92.5% and a specificity 97.9% compared to the reference standard PET/CT (κ 0.91) for nodal sites. For extranodal sites, a sensitivity of 80.4% and a specificity of 99.5% were found (κ 0.84). Concordance in Ann Arbor was found in 57 of 61 patients (κ 0.92). Discrepancies were due to misclassification of region and not lesion detection. In response assessment, a sensitivity of 100% and a specificity 99.9% for all sites combined were found (κ 0.92). There was a perfect agreement on Deauville scores 4 and 5 and criteria of response between the two modalities. Intraclass correlation coefficient (ICC) for SUVmax, SUVpeak, and MTV values showed excellent reliability (ICC > 0.9). FDG PET/MR is a reliable alternative to PET/CT in this patient population, both in terms of lesion detection at baseline staging and response assessment, and for quantitative prognostic imaging biomarkers.


Introduction
The goal of treatment for Hodgkins lymphoma (HL), diffuse large B-cell lymphoma (DLBCL), and high-grade B-cell lymphoma is most often cure. Accurate staging at baseline and response assessment is therefore essential for optimal treatment strategies. Furthermore, risk stratification and treatment decisions require reliable prognostic imaging biomarkers in addition to pretreatment clinical risk assessment scores.
It is well established that functional imaging with positron emission tomography (PET)-computed tomography (CT) with 18 F-Fluorodeoxyglucose (FDG) is the standard imaging modality for FDG-avid lymphomas. Response adaptive FDG PET/CT guided treatment approach to evaluate chemosensitivity and to guide decision of radiotherapy allows for individual treatment strategies in HL [1,2]. Furthermore, end of treatment response FDG PET/CT for detecting residual disease in DLBCL is an important prognostic factor and guides indication for consolidation radiotherapy [3].
PET/magnetic resonance (MR) has the advantage of simultaneous PET and magnetic resonance imaging (MRI) data acquisitions, combining metabolic activity from PET with great soft tissue contrast and functional data from MRI with diffusion weighted imaging (DWI). In addition, reduction of the radiation dose by eliminating the contribution from CT is of value for the young patient.
Although FDG PET/MR has shown comparable ability to PET/CT for lesion detection and anatomical staging in terms of Ann Arbor in a few studies in adult lymphoma populations [4][5][6], these studies have included highly heterogeneous lymphoma populations acquired at different time points of imaging evaluation. Only one recent study [7] has assessed a large, homogenous lymphoma population (HL) at baseline but lacks response assessment evaluations.
To establish whether FDG PET/MR is a reliable alternative to PET/CT in the care of lymphoma patients, there is also a need to compare the quantitative PET metrics between the two modalities. PET-detector technology and attenuation correction of the PET images differs between PET/CT and PET/MR, which can have a significant impact on the quantitative PET measurements.
Baseline maximum standardized uptake value (SUVmax) was one of the first quantitative PET measurements used as a prognostic biomarker, and studies have found that SUVmax from FDG PET/MR versus PET/CT correlates well [4][5][6]8]. The role of baseline SUVmax in predicting treatment outcome is however discordant, and other quantitative PET metrics such as metabolic tumor volume (MTV) are reported in PET studies with increasing frequency. MTV is a promising and robust PET-based prognostic factor in HL [9,10] and DLBCL [11], but no studies have yet compared MTV from FDG PET/MR and PET/CT in lymphoma patients.
Deauville score and PET-based criteria of response are other strong prognostic factors used in FDG PET/CTbased response assessment that guides treatment decisions. Deauville score (5-point scale) is based on visual assessment and SUV in residual lymphoma lesions compared with SUV in mediastinal blood pool and liver [12]. PETbased criteria of response [13] use Deauville score and change in SUV from baseline. The correlation of Deauville score between the FDG PET/CT and PET/MR has only been evaluated in one pediatric HL study [14], where they found excellent agreement, while no studies has compared criteria of response.
The aim for this prospective study was to investigate the diagnostic performance of FDG PET/MR with DWI compared to today's standard PET/CT during first line treatment in a patient cohort including classical HL, DLBCL, and high-grade B-cell lymphoma. Primary endpoints were region-based and patient-based (Ann Arbor) agreement. Secondary endpoints were correlation of prognostic biomarkers in terms of Deauville score, criteria of response, SUVmax, SUVpeak, and MTV using FDG PET/CT as reference standard.

Study population
Patients were enrolled from the lymphoma section at St. Olavs Hospital, Trondheim University Hospital, from June 2016 to February 2019. Sixty-four patients fulfilled the following inclusion criteria: 18 years or older, histological confirmed DLBCL, classical HL, or high-grade B cell lymphoma. Exclusion criteria were contra-indications for MRI or pregnancy. Three patients were excluded due to missing PET/MR raw data at baseline. This left 61 patients included in the study (Table 1).
All 61 patients were imaged with FDG PET/CT directly followed by PET/MR at baseline. Interim (after 2 cycles of chemotherapy only for HL) and end of treatment (3-6 weeks after chemotherapy for both HL and aggressive non-Hodgkin lymhoma (NHL)) examinations were performed on a subgroup of the patients when PET/CT was clinically indicated. A total of 108 (61 baseline, 13 interim and 34 end of treatment) FDG PET/MR and PET/ CT examinations were therefore included in the study. The study was approved by the Regional Committee for Ethics in Medical Research (REK-Midt #2014/1289). All participants gave written informed consent before participation.

Image acquisition
PET/CT and PET/MR data was acquired by using a single intravenous injection of 18 F-FDG (4 MBq/kg). PET/CT and PET/MR were acquired 60 (median) minutes (range 59-64) and 100 (median) minutes (range 87-150) after injection, respectively. All patients fasted for at least 6 h before injection of 18 F-FDG, and blood glucose levels were measured prior to radiotracer administration. None of the patients had hyperglycemia (> 10 mmol/L).

PET/CT
PET/CT was acquired on a Siemens Biograph mCT (Simens Helthcare, Erlangen, Germany). Patients were examined with their arms up in 4-9 bed positions (depending on body height), 2.5-3 min per bed position (depending on body weight), covering top of skull to upper thighs. Non-contrastenhanced, low-dose CT with 120 kV, 0.5 s rotation time, pitch 0.95, and 40 mAS was performed for attenuation correction and morphological correlation.

PET/MR
A PET/MR system (Siemens Biograph mMR, Erlangen, Germany) was used for simultaneous PET and MRI acquisitions. Patients were examined with their arms down in 5 bed positions covering top of skull to upper thighs, 5 min for each bed position. Simultaneous MRI was acquired with the following MRI sequences: coronal T1 Dixon-Vibe, transversal diffusion-weighted MRI (DWI) (b values 50 and 800), transversal T2-HASTE, and coronal T2-TIRM. Breath-hold imaging was used for bed positions 2-4, covering the thorax and abdomen. Attenuation correction maps were calculated from the T1 Dixon-Vibe sequence, segmenting four tissue types (air, soft tissue, fat, and lung) into predefined linear attenuation coefficients.

PET reconstructions
PET image reconstruction was performed with iterative reconstruction (3D OSEM algorithm, 3 iterations, 21 subsets, and 4-mm Gaussian filter) with point spread function (PSF) and decay-, attenuation-, and scatter-correction. Timeof-flight was used on PET/CT but was not available on the PET/MR system. A 400 × 400 matrix was used on the PET/ CT while a 344 × 344 matrix was used on the PET/MR (this corresponds to a relatively similar pixel size on the two scanners).

Image analyses
The PET/CT and PET/MR images were read by two pairs of nuclear medicine physicians (7 and 24 years of experience) and radiologists (13 and 14 years of experience) using the same standardized reading protocols. The nuclear medicine physicians interpreted the PET images and the radiologists interpreted the CT/MRI images separately, followed by a joint report for the PET/CT and PET/MR by each reading team. The readers were blinded for clinical status, but aware of the histology. To avoid recollection, a period of 4 weeks was required between readings from different modalities. Baseline images were available when reading interim and end of treatment scans. In case of disagreement between joint report on PET/CT or PET/MR between the two reading teams, a final consensus was made by a third group consisting of a clinician with access to biopsy results, primary staging, and follow-up results and one reader from each reading team. Standard clinical software, Syngo.Via (Simens Healthineers) and AW server (GE Healthcare) were used.
Anatomical staging in terms of extent of lymphoma disease with the modified Ann Arbor staging system [13] was performed separately by a lymphoma oncologist based on the PET/CT and PET/MR readings in the study.

PET reading
The Lugano Classification [13] criteria for staging and response assessment were used for the PET reading. Diffuse uptake in the spleen without focal lesions had to be higher than 150% of SUV liver to be classified as diffusely involved [10]. For response assessment, Deauville score of 5 was defined as ≥3 times greater than liver. SUVmax and SUVpeak were recorded in all disease regions at baseline, and in response assessment, SUVmax was measured in the tumor with highest uptake.

Metabolic tumor volume
MTV was computed using the research software ACC URA-TE, a semi-automatic software tool for quantitative analysis of PET/CT [16]. Both nuclear medicine physicians segmented MTV independently on baseline PET/MR and PET/CT scans for the patients with aggressive NHL. Initially, an automated analysis was done with fixed SUV-threshold of 4.0 [17] before physiological uptake was excluded from the volume.

CT and MR reading
On CT and MRI, a lymph node > 15 mm in largest diameter on axial sequences was defined as pathological for lymphoma involvement. Morphological criteria for splenic involvement were focal lesions or craniocaudal diameter more than 13 cm on coronal CT or MRI. Bulky tumor was defined as ≥ 10 cm in largest diameter. The reading of different MRI sequences was performed simultaneously with no distinct order to combine morphological and structural information.

Statistical analyses
Inter-observer agreement between the two reading teams was assessed by kappa statistics for nominal categorical variables and weighted kappa score for ordinal categorical variables. Kappa values were indicative of poor (k < 0.2), fair (k 0.2-0.4), moderate (k > 0.4-0.6), good (k > 0.6-0.8), and excellent (k > 0.8) agreement [18]. The kappa values and the 95% confidence intervals were calculated for nodal and extranodal sites (lymphoma involvement or not) combined and disease stage (Ann Arbor I-II versus III-IV). Weighted kappa statistics were used for Deauville score (1-3, 4, or 5) and criteria of response (complete metabolic response, partial metabolic response, no metabolic response, or progressive metabolic response). Intraclass correlation coefficients (ICC) were used for inter-observer agreement for the continues variable MTV. ICC estimates less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.9, which indicates poor, moderate, good, and excellent reliability, respectively [19]. Concordance for lymphoma involvement between consensus FDG PET/MR and the reference standard consensus PET/CT was assessed using kappa statistics, observed agreement, positive predictive values (PPV), negative predictive values (NPV), sensitivity and specificity at baseline, and response assessment scans (interim and end of treatment response). ICC estimates and their 95% confidence interval were calculated using absolute agreement and two-way mixed-effects model for continuous variables (SUVmax, SUVpeak, and MTV). ICC measurements showed no correlation (below 0.1) between measurements from multiple lesions in the same patient both on PET/CT and PET/MR; thus, no correction was performed when including measurements from multiple lesions in the same patients.
Additionally, Bland-Altman plots [20] were made to evaluate the bias between the mean differences of SUVmax and MTV measured on FDG PET/MR and PET/CT. The difference of SUVmax between PET/MR and PET/CT was not normally distributed, and a nonparametric approach was therefore used by employing the median and inter-quartile range to estimate the limits of agreement (LoA) (± 1.45 *IQR) in the Bland-Altman plot. The difference of MTV was normally distributed, and a parametric method was used in terms of the mean difference and the standard deviation to estimate the LoA (mean ± 1.96 SD of the difference) in the Bland-Altman plot. All statistical analyses were performed using SPSS version 26.0.

Consensus FDG PET/MR vs. PET/CT at baseline
When comparing consensus baseline FDG PET/MR with PET/CT as the reference standard (Table 3) (1)]. An example of a false-positive extranodal site on PET/MR is shown in Fig. 2, where PET/MR demonstrates a compelling radiological FDG avid lesion in pancreas that was scored as a paraaortic lymph node on PET/CT. Eleven extranodal sites were false negative on PET/MR compared with PET/CT [bone marrow (2), adrenal glands (2), GI tract (1), pancreas (1), thyroid (1), and soft tissue including pleura (4)]. Figure 3 shows an example of a false-negative lesion in the small intestine on PET/MR compared to a distinct FDG uptake on PET/CT, which was also histologically confirmed with biopsy from ileum.
We found concordance in disease stage (Ann Arbor 0-IV) in 57/61 patients (93%) with a weighted kappa value of κ 0.92 (95% CI 0.85-1). When comparing limited disease, I-II versus extended disease III-IV, 59/61 was staged similar on PET/MR and PET/CT. Table 4 shows consensus interim and end of treatment FDG PET/MR compared with PET/CT as the reference standard. Observed agreement for nodal sites combined was 798/799 (99.9%), with a sensitivity 100%, specificity 99.9%, PPV 87.5%, and NPV100%. The discrepant was a false-positive axillary node on PET/MR. For extranodal sites, the observed agreement was 751/752 (99.8%), sensitivity 100%, specificity 99.9%, PPV 80.0%, and NPV 100%. One PET/MR examination graded the bone marrow as false positive compared to PET/CT.

Consensus FDG PET/MR vs. PET/CT for response assessment
All of the 47 consensus interim and end of treatment examinations were scored similar on FDG PET/MR and PET/CT in terms of criteria of respons. Deauville score grading showed good agreement on weighted kappa, κ 0.72 (95% CI 0.54-0.89). Difference in Deauville score grading between PET/MR and PET/CT was seen in 11 patients and only in those with complete metabolic response (Deauville score 1-3). Figure 4 shows a patient with partial metabolic response and Deauville score 5 at end of treatment response on both image modalities, while the FDG avid lesion was scored as soft tissue on PET/CT but correctly identified in scapula on PET/MR.

Metabolic tumor volume at baseline
MTV was measured in 33 patients with DLBCL or highgrade B-cell lymphoma at baseline. Two patients were not included in this analysis due to no detectable disease on both modalities, and one was excluded because it was impossible to delineate lymphoma tissue form kidney and bladder on both PET modalities. ICC estimate and the 95% confidence intervals were 0.99 (0.98-1) for MTV, showing excellent reliability (p < 0.001). A Bland-Altmann plot of MTV (Fig. 6) showed a slightly higher MTV with PET/CT than

Discussion
In this prospective study, we have compared FDG PET/MR to today's standard PET/CT at baseline and response assessment in 61 patients with classical HL, DLBCL, or highgrade B-cell lymphoma.
Excellent kappa agreement was found for lymphoma staging of nodal sites combined when comparing consensus FDG PET/MR versus PET/CT with a sensitivity of 92.5% and a specificity of 97.9%. The nodal regions with most discrepancies were infraclavicular, hilar, and mediastinal lymph nodes. Previous studies comparing MRI versus FDG PET/CT for lymphoma staging have also found these nodal regions most challenging [21,22]. In all cases of discrepancies in the current study, both modalities showed the same FDG avid lesions, but the readers scored the lymph node region different between FDG PET/MR and   (Fig. 1). Motion artifacts and poor quality of the DWI sequences were reported in one of the misclassifications of hilar and mediastinal nodal regions. A few of the discrepancies can be explained by higher SUV value on FDG PET/CT than on PET/MR and vice versa and therefore classified as reactive lymph node on one of the modalities.
For staging of extranodal sites combined, excellent kappa agreement between consensus FDG PET/MR and PET/CT was also found, but with a lower overall sensitivity and specificity for PET/MR. The exstranodal regions with most discrepancies were pancreas, bone marrow, and soft tissue/ pleura. All, except one (Fig. 3), of the lesions were visible on both modalities, and the main reason for the discrepancies was different interpretation of location of the lesions, as illustrated in Fig. 2. Other lymphoma studies have found a higher concordance of extranodal disease between FDG Table 4 Consensus PET/MR versus consensus PET/CT response assessment (13 interim and 34 end of treatment scans) in terms of nodal and extranodal sites separate and combined. Deauville score and criteria of response *Skin, genitalia, brain, and bladder: no disease **Weighted kappa NA, not applicable due to no true positive lymphoma lesions on PET/CT and no false negative on PET/MR or one of the variables are constant.

Site
Observed  PET/MR and PET/CT [6,8,14]. However, compared to these studies, our patient population had a higher burden of extranodal disease, and in contrast to other study designs [22], we did not correct reader errors before the statistical analyses. This approach was chosen to reflect clinical imaging reading. Concordance in disease stage (Ann Arbor) was found in 57 of 61 patients. Three patients were understaged and one upstaged with FDG PET/MR versus PET/CT. The reason for under staging on FDG PET/MR was two patients with soft tissue or pleural involvement on PET/CT classified as nodal lesions on PET/MR, and one periaortic node with lower SUVmax on PET/MR and therefore interpreted as a reactive node. The discrepancy that led to upstaging on FDG PET/MR was a cervical lymph node with higher SUVmax on FDG PET/MR, interpreted as reactive node on PET/CT. In our cohort, this would have had clinical treatment consequences in one HL patient if staged with FDG PET/MR instead of PET/CT.
In response assessment scans, we found a sensitivity of 100% and a specificity of 99.9% for FDG PET/MR on both nodal and extranodal sites combined. Seven of the 43 response assessment scans had FDG avid disease present on both modalities. Among these patients, one patient with partial metabolic response on both examinations also had an axillary node that was graded positive on FDG PET/MR due to higher SUVmax than on PET/CT. In addition, one bone marrow was graded positive on FDG PET/MR and labeled as soft tissue on PET/CT (Fig. 4).
A recent study of pediatric HL patients found excellent agreement on Deauville score grading between FDG PET/MR and PET/CT [14]. Our study also showed a perfect agreement on Deauville score 4 and 5 between the two modalities. Differences were only found in patients with complete metabolic response (Deauville score 1-3) and would not have altered any treatment decisions. The reason for this difference is hard to postulate due to no observable trend in the results. In addition, all consensus response assessment scans were graded similar on FDG PET/MR and PET/CT in terms of criteria of respons, meaning that none of the included patients would have been treated any different based on FDG PET/MR response assessment examinations versus PET/CT.
When studying SUVmax and SUVpeak, both showed excellent reliability, but a slightly higher median SUVmax was found on FDG PET/MR compared to PET/CT. These findings may relate to the prolonged uptake time on PET/MR in our study (median 100 min after injection), as increased SUV is reported in lymphoma patients until 2 h after FDG injection [23]. Previous lymphoma studies acquiring PET/ CT before PET/MR has however demonstrated both higher SUVmax [5] and lower SUVmax [8,14] on PET/MR, indicating that the differences in SUV between the modalities are caused by other factors such as heterogeneity in the included patients in the studies or technical differences between the sites.
To our knowledge, this is the first study to compare MTV between FDG PET/CT and PET/MR in lymphoma patients. Our results showed excellent reliability in baseline MTV in 33 patients with aggressive NHL. MTV was slightly higher with FDG PET/CT than with PET/MR with a mean difference of 7.4 cm 3 . When considering the scale of MTV values in our population, both the mean difference and the LoA indicate good agreement among most of the patients and no systematic bias of clinical importance. There was no misclassification in attenuation correction maps or other external explanations in the images with the largest differences. MTV has previously been compared between FDG PET/ CT and PET/MR retrospectively in a lung and pancreatic cancer population [24] and the authors found that a threshold of SUV2.5 was more robust against imaging modality and protocol compared to SUV50%. Our study did however only use one method for calculating MTV, which could be a limitation. There is controversy regarding which method to use for calculating MTV, and ongoing work is done in order to standardize MTV measurements worldwide [25]. However, although the use of various contouring thresholds, such as SUV2.5, SUV4.0, and 41% of SUVmax, results in significantly different MTV values for the same PET-data, the various MTV methods have shown to predict prognosis with similar accuracy [25,26]. For that reason, only one method was used to calculate MTV in the current study. A recent study found that SUV4.0 was one of the two best methods for calculating MTV in lymphoma patients [17]. Together with our results, this study supports the use of SUV4.0 as a robust method for calculating MTV.
A notable limitation of our study is that all the FDG PET/ MR scans was performed after PET/CT. Future studies may scan half of the patients with PET/MR first to even out the difference in time of administration of FDG to time of imaging. Furthermore, the lack of a diagnostic contrast-enhanced CT for the radiologist to use for anatomical correlation when interpreting FDG PET/CT may have affected the interpretation of extranodal lesion localizations on the reference standard.
In conclusion, our study shows an overall excellent agreement between FDG PET/MR and PET/CT in terms of lesion detection at baseline staging and response assessment for adult patients with classical HL, DLBCL, and high-grade B-cell lymphoma. The discrepancies between the modalities were mainly due to misclassification of region and not due to lesion detection. The PET-based quantitative prognostic imaging biomarkers also showed good agreement between the two modalities, demonstrating that FDG PET/MR is a reliable alternative to PET/CT in this patient population.