Whole-body MRI versus an [18F]FDG-PET/CT-based reference standard for early response assessment and restaging of paediatric Hodgkin’s lymphoma: a prospective multicentre study

Objectives To compare WB-MRI with an [18F]FDG-PET/CT-based reference for early response assessment and restaging in children with Hodgkin’s lymphoma (HL). Methods Fifty-one children (ages 10–17) with HL were included in this prospective, multicentre study. All participants underwent WB-MRI and [18F]FDG-PET/CT at early response assessment. Thirteen of the 51 patients also underwent both WB-MRI and [18F]FDG-PET/CT at restaging. Two radiologists independently evaluated all WB-MR images in two separate readings: without and with DWI. The [18F]FDG-PET/CT examinations were evaluated by a nuclear medicine physician. An expert panel assessed all discrepancies between WB-MRI and [18F]FDG-PET/CT to derive the [18F]FDG-PET/CT-based reference standard. Inter-observer agreement for WB-MRI was calculated using kappa statistics. Concordance, PPV, NPV, sensitivity and specificity for a correct assessment of the response between WB-MRI and the reference standard were calculated for both nodal and extra-nodal disease presence and total response evaluation. Results Inter-observer agreement of WB-MRI including DWI between both readers was moderate (κ 0.46–0.60). For early response assessment, WB-MRI DWI agreed with the reference standard in 33/51 patients (65%, 95% CI 51–77%) versus 15/51 (29%, 95% CI 19–43%) for WB-MRI without DWI. For restaging, WB-MRI including DWI agreed with the reference standard in 9/13 patients (69%, 95% CI 42–87%) versus 5/13 patients (38%, 95% CI 18–64%) for WB-MRI without DWI. Conclusions The addition of DWI to the WB-MRI protocol in early response assessment and restaging of paediatric HL improved agreement with the [18F]FDG-PET/CT-based reference standard. However, WB-MRI remained discordant in 30% of the patients compared to standard imaging for assessing residual disease presence. Key Points • Inter-observer agreement of WB-MRI including DWI between both readers was moderate for (early) response assessment of paediatric Hodgkin’s lymphoma. • The addition of DWI to the WB-MRI protocol in early response assessment and restaging of paediatric Hodgkin’s lymphoma improved agreement with the [18F]FDG-PET/CT-based reference standard. • WB-MRI including DWI agreed with the reference standard in respectively 65% and 69% of the patients for early response assessment and restaging. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-021-08026-1.

• The addition of DWI to the WB-MRI protocol in early response assessment and restaging of paediatric Hodgkin's lymphoma improved agreement with the [18F]FDG-PET/CT-based reference standard. • WB-MRI including DWI agreed with the reference standard in respectively 65% and 69% of the patients for early response assessment and restaging.

Introduction
Paediatric Hodgkin's lymphoma is nowadays a highly curable malignancy and it is the most common type of cancer in adolescents [1,2]. Accurate assessment of response therapy, both after the first two courses of chemotherapy (early response assessment, ERA) and at the end of therapy (restaging), is of great importance for outcome prediction and tailoring the therapy schedule to the individual patient. [ 18 F]FDG-PET/CT is the current recommended imaging modality in international guidelines for (interim) response assessment [3]. A drawback of [ 18 F]FDG-PET/CT is the administration of ionizing radiation, especially if a high-dose rather than a lowdose CT is used [4]. Moreover, in many hospitals around the world, a contrast-enhanced CT is still part of standard clinical procedures [4]. In between the initial diagnosis and end of treatment, children diagnosed with Hodgkin's lymphoma undergo multiple imaging examinations and are thus exposed to a substantial dose of ionizing radiation. It is shown in the literature that children as compared to adults have an increased vulnerability to the long-term side effects of ionizing radiation such as secondary malignancies [5][6][7][8]. This underlines the need for careful administration of ionizing radiation in childhood, both for therapy purposes and in diagnostic imaging protocols.
Whole-body magnetic resonance imaging (WB-MRI) with diffusion-weighted imaging (DWI) was already shown to be a radiation-free alternative with high accuracy for staging paediatric Hodgkin's lymphoma [9][10][11][12][13]. The functional information provided by DWI has been found to be a useful addition to conventional WB-MRI protocols for staging Hodgkin's lymphoma [13], but despite these positive results for staging, the first studies focussing on response assessment do not show consistent results. Some authors have reported that WB-MRI cannot match [ 18 F]FDG -PET/CT for response assessment [10,[14][15][16], while others report that WB-MRI may be a useful alternative not only in staging but also in treatment response assessment in lymphoma patients [16][17][18][19]. Since most of these studies focus on heterogeneous lymphoma populations with different tumour characteristics and treatment protocols, drawing conclusions on the value of WB-MRI in paediatric Hodgkin's lymphoma is challenging [20].
Therefore, the aim of this prospective, multicentre study was to assess the performance of WB-MRI as compared to an [ 18 F]FDG-PET/CT-based reference standard in the detection of residual disease in early response assessment and restaging after completion of therapy in a study population comprising only paediatric Hodgkin's lymphoma patients.

Methods
For this prospective international multicentre cohort study, patients were recruited in nine hospitals: University Medical Centre Utrecht, University Children's Hospital Vall d'Hebron Barcelona, Amsterdam University Medical Centres, CHEO-Ottawa, Giannina Gaslini Children's Hospital Genova, Erasmus Medical Centre -Sophia Children's Hospital Rotterdam, Materno Infantile Burlo Garofolo Trieste and Oslo University Hospital Rikhospitalet. All institutional review boards provided approval of this prospective study. Written informed consent was obtained from all study participants and/or their parents or legal guardians.

Patient population
The patients included from the European participating hospitals also participated in the EuroNet PHL-C1 trial (First international Inter-Group Study for classical Hodgkin's Lymphoma in Children and Adolescents) [21,22]. Inclusion criteria for the current study were children aged 10-17 years old, with histologically proven Hodgkin's lymphoma. Exclusion criteria included general MRI contraindications such as pacemakers and claustrophobia. Patients were also excluded if the number of days between [ 18 F]FDG-PET/CT and WB-MRI exceeded 15 days for ERA or 30 days for restaging. All patients were included between 2012 and 2016 and staging results of the participants were reported in the previous article [13].

WB-MRI and [ 18 F]FDG-PET/CT procedures
All patients underwent both WB-MRI and [ 18 F]FDG-PET/CT examinations after two cycles of chemotherapy (ERA, performed 14-17 days after chemotherapy) ( Table 1). A subset of patients also received both imaging examinations after completion of therapy (restaging) (performed 14-17 days after completion of therapy). WB-MRI and [ 18 F]FDG-PET/CT were performed within a 15-day timeframe for ERA (median 0.0 days, interquartile range (IQR) −4 to 0.5) and a 30-day timeframe for restaging after therapy (median 0.0 days, IQR −1 to 3.5).
WB-MRI was performed using a 1.5-T system (Philips Healthcare) or Siemens or GE Medical Systems) or a 3.0-T system (Siemens) and image acquisition took place from the top of the head to the upper thigh. Coronal whole-body T1-weighted (T1W, except for CHEO, where only T2-weighted and diffusion-weighted images were acquired), T2-weighted images (T2W) and diffusion-weighted images (b0, b100 and b800 s/mm 2 ) were acquired under free-breathing, except for the stations covering the chest and abdomen, which were acquired using breath-holding (T1W) or respiratory triggering (T2W). Seamless coronal whole-body T1W and T2W images were created by merging separately acquired stations using software implemented in the standard operating console. Axial diffusion-weighted images were first coronally reformatted and then merged to create seamless coronal whole-body diffusion-weighted images. The duration of the examination, including patient preparation time, was approximately 60 min. The detailed WB-MRI parameters were as described before [13] and are shown in the supplementary material file. Approximately 60 min after the administration of [ 18 F]FDG, PET images were acquired according to the European Association of Nuclear Medicine (EANM) recommendations [23]. PET images were combined with low-dose CT (Biograph 16 PET-CT or Biograph 40 Truepoint PET-CT, Siemens Healthcare; Gemini TOF PET-CT or Allegro, Philips Healthcare). Imaging was performed from mid-thigh to skull base.

WB-MRI and [ 18 F]FDG-PET/CT interpretation
After completion of all examinations, the images were collected and de-identified and distributed to the readers. The WB-MRI scans were reviewed by two independent radiologists (R.A.J.N. and T.C.K. with 25 and 10 years of MRI experience in (paediatric) oncology, respectively) and the [ 18 F]FDG-PET/CT images were scored by a nuclear medicine physician (B.d.K. with 15 years of [ 18 F]FDG-PET/CT experience in (paediatric) oncology). None of the readers had access to information regarding clinical status and other imaging or laboratory findings except from the initial staging scans, to be able to distinguish pre-existent lesions from new disease presence. Either OsiriX Lite Medical Imaging Software (Pixmeo, www. osirix-viewer.com) or Horos Project Software (www. horosproject.org) was used.
For scoring, all readers used a standardised form that was based on the EuroNet PHL C1 trial [21,22]. The EuroNet PHL C1 trial is an international, multicentre, randomised controlled trial with the aims to reduce the use of radiotherapy in paediatric Hodgkin lymphoma and to compare different therapy strategies in patients with advanced or intermediate disease.
Residual disease presence was scored positive or negative for the predefined anatomical sites (10 nodal and all extranodal stations). Nodal stations were cervical, axillary, infraclavicular, mediastinal, pulmonary hilar, spleen, paraaortic, mesenteric, para-iliac and inguinal. The lymph node stations were considered negative for residual disease if all nodal lesions had regressed to ≤ 15 mm in the longest diameter [3]. The extra-nodal stations were considered negative for disease presence if no abnormal signal intensities or masses were found. Finally, response was reported as complete response (CR), partial response (PR), stable disease (SD) or progressive disease (PD) based on tumour diameters and [ 18 F]FDG uptake as defined by the EuroNet PHL C1 protocol [3,24]. For [ 18 F]FDG-PET/CT, the Deauville scoring system was used. This is a 5-point score, based on visual assessment of the [ 18 F]FDG uptake. At DWI, the presence of restricted diffusion was visually examined based on a high signal at high b-value DWI and a lower signal compared to muscles or the spinal cord at ADC [24].
The WB-MRI readers evaluated the images in two sets. First are the conventional sequences alone (T1-weighted and T2-weighted images) followed by the conventional sequences combined with DWI immediately thereafter. An independent paediatric radiologist (A.S.L., with 15 years of MRI experience in (paediatric) oncology) solved all discrepancies between both WB-MRI reviewers to form the consensus MRI datasets.
For [ 18 F]FDG-PET/CT, residual lesions were considered positive if their activity exceeded the uptake in the mediastinal blood pool [3]. The criteria for treatment response were based on those used in the EuroNet PHL C1 trial [24] in which the mediastinal blood pool was still used as reference. It should be noted that in newer versions of the EuroNet trial the uptake in the liver is used as a reference. For this current study, however, new lesions with an uptake above the mediastinal blood pool were also considered positive if no other explanation (e.g. inflammation) was more likely to be the underlying cause of the lesion.

Reference standard and intrinsic WB-MRI
All discrepancies between the scoring results from consensus WB-MRI including DWI and [ 18 F]FDG-PET/CT were discussed by an independent expert panel consisting of a nuclear medicine physician (N.T., with 9 years of [ 18 F]FDG-PET/CT experience in (paediatric) oncology) and a paediatric radiologist (A.S.L., with 15 years of MRI experience in (paediatric) oncology). The expert panel had access to all available data, including the results from clinical, histopathological and imaging examinations. For all discrepancies between WB-MRI including DWI and [ 18 F]FDG-PET/CT, the expert panel decided whether the discrepancy was a reader error or an intrinsic error. Intrinsic errors were caused by the limitations of the imaging modality itself (e.g. caused by artefacts). Reader errors could either be an interpretation error (incorrect classification of an abnormality) or a perceptual error (reader did not detect the abnormality). To form the [ 18 F]FDG-PET/CT-based reference standard, the expert panel corrected all reader and intrinsic errors from the [ 18 F]FDG-PET/CT reading. The intrinsic WB-MRI was formed by removing all reader errors from the consensus WB-MRI including DWI.
Agreement between WB-MRI without DWI, WB-MRI with DWI and intrinsic WB-MRI and the [ 18 F]FDG-PET/ CT-based reference standard was assessed by calculating total agreement, positive predictive value (PPV) and negative predictive value (NPV). Those were calculated between WB-MRI and the reference standard for lymphoma detection per patient (response classification). Sensitivity and specificity with corresponding 95% confidence intervals (CI) of WB-MRI without DWI, WB-MRI with DWI and intrinsic MRI for early response assessment and restaging were calculated against the reference standard. For all analyses in which multiple stations were assessed together, clustering within patients had to be considered. Multilevel analyses were performed as proposed by Vanbelle et al for the kappa statistics [26]. For observed agreement, PPV and NPV a mixed effect logistic regression model was used, taking clustering within patients into account using random intercepts.
All statistical analyses were performed using the Statistical Package for the Social Sciences (SPSS version 25.0,) or the R statistical software package version 3.5.1 (R development core team).

Patient population
Patient selection and inclusion were prospectively conducted between 2012 and 2016. A total of 54 patients were found eligible for inclusion. Three patients had to be excluded due to the following reasons: one patient had an incomplete MRI study, and in two patients, the time interval between WB-MRI and [ 18 F]FDG-PET/CT exceeded 15 days for early response assessment. All remaining 51 patients underwent WB-MRI and [ 18 F]FDG-PET/CT for early response assessment. A subset of 13 patients received both WB-MRI and [ 18 F]FDG-PET/ CT at restaging. Table 1 shows the patient characteristics, including age, gender, Hodgkin's lymphoma subtype, response classification, received therapy and number of patients included per participating centre. The majority of the patients (88%) received two cycles of OEPA (vincristine, etoposide, prednisone, Adriamycin (doxorubicin)) before early response assessment and most patients (39%) received 2 cycles of COPDAC (cyclophosphamide, doxorubicin, prednisone, dacarbazine) between early response assessment and restaging. Figure 1 shows a pie chart of the administered drug combinations, both before early response assessment and between early response assessment and restaging.

Inter-observer agreement WB-MRI including DWI
The inter-observer agreement between both WB-MRI readers is shown in Table 2

Early response assessment
The diagnostic value of WB-MRI without DWI, WB-MRI with DWI and intrinsic WB-MRI for early response assessment is presented in Table 3. At WB-MRI without DWI, a total of 47 out of 51 patients were diagnosed with a residual lesion, but in only 12 out of those 47 patients, the reference standard was positive for residual lesions. The overall observed agreement between WB-MRI without DWI and the reference standard was 29%. Sensitivity and specificity for determining response status were 92% and 8% respectively for WB-MRI without DWI. The addition of DWI improved overall agreement (65%) and specificity (68%), but sensitivity decreased to 54%. For the intrinsic WB-MRI, the observed agreement was 71% (36/51, 95% CI 57-81%). Sensitivity improved (77%) and specificity remained the same (68%) compared to the WB-MRI reading. For all WB-MRI datasets, the NPV exceeded the PPV (ranging from 26 to 45%), but none exceeded 90%. Table 4 shows the amount of true-positive, false-positive, false-negative and true-negative disease stations and observed agreement for WB-MRI compared to the reference standard. Observed agreements for both nodal and extra-nodal residual disease detection for early response assessment are  Fig. 2, an example of an intrinsic WB-MRI error in early response assessment is shown.

Restaging after completion of therapy
Only 13 patients were available for the restaging analysis. According to the reference standard, residual disease was present in one patient and one patient was diagnosed with progressive disease. Table 3 shows observed agreement, sensitivity, specificity and both PPV and NVP for restaging. The observed agreement between WB-MRI and the reference standard was 38% (5/13, 95% CI 18-64%) and improved with the addition of DWI to 69% (9/13, 95% CI 42-87%). In Table 4, the amount of true-positive, false-positive, false-negative and true-negative disease stations and observed agreement for WB-MRI including DWI compared to the reference standard are presented. Observed agreements for both nodal and extranodal residual disease detection for restaging are both high (97% and 100% respectively). Figure 3 shows an example of response evaluation at restaging.

Discussion
This multicentre, prospective, international study compared WB-MRI with an [ 18 F]FDG-PET/CT-based reference standard for detecting residual disease in early response assessment and restaging of paediatric Hodgkin's lymphoma.
Results demonstrate a moderate inter-observer agreement between WB-MRI readers. This is comparable to a previous study in which inter-observer agreement for WB-MRI including DWI was moderate for restaging after treatment as well [14]. For extra-nodal disease, it should be noted that the relatively low kappa values are caused by a very low amount of stations in which extra-nodal residual disease involvement was found. The observed agreement for detecting extranodal disease between both WB-MRI readers was 98%, however, which is probably more indicative of the agreement between both readers.
For the per-lesion analyses, high percentages of observed agreement between [ 18 F]FDG-PET/CT and WB-MRI including DWI (96-100%) were found for both nodal and extranodal detection of residual disease and for both early response assessment and restaging. However, especially for early response assessment, many false-positive stations were observed (around 4% of the total number of stations) compared to the reference standard. This indicates a low specificity for detecting nodal residual disease. This considerable number of false positives was in line with the expectations since it is known that residual masses are often detected without harbouring residual disease [27][28][29]. The additional value of DWI in detecting active disease was therefore analysed. With regard to the per patient analyses, the agreement  CI, confidence interval between WB-MRI and the [ 18 F]FDG-PET/CT-based reference standard improved with the addition of DWI for both early response assessment and restaging, but despite this major improvement, the agreements all remained insufficient with a maximum of 71% for early response assessment intrinsic WB-MRI. Similar conclusions can be drawn with regard to both sensitivity and specificity for detecting residual disease. This might be partly due to the amount of reader errors in the WB-MRI including DWI reading, as sensitivity and specificity both increased for the intrinsic WB-MRI, and also for the intrinsic WB-MRI data sensitivity and specificity remained inadequate (sensitivity 77% and specificity 68%).
This summary of our results points out that the value of WB-MRI for early response assessment and restaging is limited and cannot replace [ 18 F]FDG-PET/CT, mainly due to an underestimation of response to treatment. These results are comparable to recent literature in which the value of WB-MRI during early response assessment and/or restaging in all types of lymphoma was questioned [10,14,15]. However, the results of the current study are opposite to the results from Herrmann et al, Maggialetti et al and Mayerhoefer et al, who all reported that WB-MRI might be a useful alternative in response assessment of lymphomas [16][17][18]. These different outcomes might be due to differences in study populations including different lymphoma types. The For the present study, only Hodgkin's lymphoma patients were included, and therefore, our results might not be comparable to those earlier studies. [ 18 F]FDG-PET/MRI is being suggested in recent literature as an alternative to WB-MRI and [ 18 F]FDG-PET/CT offering the advantage of reduction of administered ionizing radiation since no CT is needed for attenuation correction and ability to lower the [ 18 F]FDG dose because of the long acquisition time of the MRI. However, the additional value of PET/MRI is still largely unclear [30,31].
Several limitations of our study need to be considered. First, since it is undesirable to obtain histopathological evidence of disease presence in all residual lesions, a true gold standard is not available for response assessment in Hodgkin's lymphoma. It should be taken into consideration that although [ 18 F]FDG-PET/CT is the recommended imaging modality in the international guidelines [3], both false-positive and falsenegative results do occur [32]. To undertake this lack of a true gold standard, the expert panel used all available data to form the [ 18 F]FDG-PET/CT-based reference standard. This use of all clinical, laboratory, follow-up and imaging data caused the reference standard and the index test not to be completely independent from each other. Although this consensus-based method does resemble clinical practice and was used in similar studies before [10,14], the differences between WB-MRI and the reference standard are probably somewhat underestimated due to this study design. Second, differences in MRI systems and field strengths were unavoidable due to the multicentre study design, and these differences combined with a long period of inclusion caused quality differences across the study data. This was most visible for the diffusion-weighted images and these differences might have caused bias in the DWI reading and perceptual errors. Furthermore, for this study, only visual assessment of DWI and apparent diffusion coefficient (ADC) images was used rather than the additional value of quantitative ADC measurements in improving the accuracy of WB-MRI. The lack of an optimal ADC cut-off value and the limited accuracy of ADC values for smaller lesions were reasons for the decision to use only DWI assessment only for this study [10,14,15,33]. Concerning the assessment of the DWI, it might have also been beneficial to use a larger variety of bvalues instead of just b0, b100 and b800. Given that was already shown in the recent literature that the accuracy for distinguishing between benign and malignant lymph nodes is not the same for all b-values, adding b200 would probably have been the most beneficial [34].
Third, the intrinsic WB-MRI dataset is probably an overestimation of reality, since reader errors are an unavoidable part of clinical practice. For restaging, no difference was found between WB-MRI including DWI and intrinsic WB-MRI, so the overestimation due to removing reader errors is likely limited. However, for early response assessment, three more patients agreed with the reference standard for intrinsic WB-MRI compared to WB-MRI with DWI; therefore, reader errors were of more influence in early response assessment.
Fourth, in line with the EuroNet protocol [24], only participants with residual disease (uptake above the mediastinal blood pool at [ 18 F]FDG-PET) at early response assessment underwent [ 18 F]FDG-PET/CT at restaging after completion of therapy, whereas all patients received a restaging WB-MRI. The number of patients that could be included for analysis of restaging data was, therefore, biased and limited to 13 patients, resulting in few residual lesions and limited opportunities for statistical analysis.
Finally, in the EuroNet PHL C1 protocol, determining whether or not the tumours have reduced by at least 50% is also part of early response assessment [24]. We did not implement that criterion in our study, but it could be argued that WB-MRI is viable for determining the percentage of decreased tumour volume and that it could be complementary to [ 18 F]FDG-PET/CT in that respect, since the CT is mostly low-dose in early response assessment. Furthermore, as mentioned in the "Methods" section, we used the EuroNet PHL C1 protocol in this study. The current EuroNet PHL study (C2) uses a study protocol that differs from the first, for instance by using a newer Deauville scoring system.
To conclude, the addition of DWI to the WB-MRI protocol in both early response assessment and restaging improved the accuracy in detecting residual disease as compared to an [ 18 F]FDG-PET/CT-based reference standard. However, WB-MRI remained discordant in 30% of the patients compared to standard imaging for assessing residual disease presence.
Funding This project was financially supported by the Stichting Kinderen Kankervrij (KiKa, project number 87). The data collection, analysis and interpretation of data, the writing of the paper and the decision to submit were not influenced by KiKa.

Compliance with ethical standards
Guarantor The scientific guarantor of this publication is R.A.J. Nievelstein.

Conflict of interest
The authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.

Statistics and biometry
No complex statistical methods were necessary for this paper.
Informed consent Written informed consent was obtained for all included children and/or their parents or legal guardians Ethical approval Institutional Review Board approval was obtained.
Study subjects or cohorts overlap Staging results of the cohort described in this study were reported in our previous article: 'Whole-body MRI versus an FDG-PET/CT-based reference standard for staging of paediatric Hodgkin lymphoma: a prospective multicentre study.' Methodology • prospective • cross-sectional study / diagnostic study • multicenter study Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.