Whole-body MRI versus an FDG-PET/CT-based reference standard for staging of paediatric Hodgkin lymphoma: a prospective multicentre study

Objectives To assess the concordance of whole-body MRI (WB-MRI) and an FDG-PET/CT-based reference standard for the initial staging in children with Hodgkin lymphoma (HL) Methods Children with newly diagnosed HL were included in this prospective, multicentre, international study and underwent WB-MRI and FDG-PET/CT at staging. Two radiologists and a nuclear medicine physician independently evaluated all images. Discrepancies between WB-MRI and FDG-PET/CT were assessed by an expert panel. All FDG-PET/CT errors were corrected to derive the FDG-PET/CT-based reference standard. The expert panel corrected all reader errors in the WB-MRI DWI dataset to form the intrinsic MRI data. Inter-observer agreement for WB-MRI DWI was calculated using overall agreement, specific agreements and kappa statistics. Concordance for correct classification of all disease sites and disease stage between WB-MRI (without DWI, with DWI and intrinsic WB-MRI DWI) and the reference standard was calculated as primary outcome. Secondary outcomes included positive predictive value, negative predictive value and kappa statistics. Clustering within patients was accounted for using a mixed-effect logistic regression model with random intercepts and a multilevel kappa analysis. Results Sixty-eight children were included. Inter-observer agreement between WB-MRI DWI readers was good for disease stage (κ = 0.74). WB-MRI DWI agreed with the FDG-PET/CT-based reference standard for determining disease stage in 96% of the patients versus 88% for WB-MRI without DWI. Agreement between WB-MRI DWI and the reference standard was excellent for both nodal (98%) and extra-nodal (100%) staging. Conclusions WB-MRI DWI showed excellent agreement with the FDG-PET/CT-based reference standard. The addition of DWI to the WB-MRI protocol improved the staging agreement. Key Points • This study showed excellent agreement between WB-MRI DWI and an FDG-PET/CT-based reference standard for staging paediatric HL. • Diffusion-weighted imaging is a useful addition to WB-MRI in staging paediatric HL. • Inter-observer agreement for WB-MRI DWI was good for both nodal and extra-nodal staging and determining disease stage. Electronic supplementary material The online version of this article (10.1007/s00330-020-07182-0) contains supplementary material, which is available to authorized users.


Introduction
Hodgkin lymphoma (HL) is amongst the most prevalent childhood cancers, and it is the most common type of cancer in adolescents [1]. After diagnosis, determining the extent of disease (staging) is important for the choice of treatment. The Lugano staging system that is used for staging HL distinguishes four disease stages, with (B) or without (A) disease symptoms or E-lesions (E, extra-nodal extension) [2,3]. The standard treatment consists of chemotherapy and radiotherapy. Limited-stage disease needs less treatment than advancedstage disease. Radiotherapy can be omitted based on 18 Ffluorodeoxyglucose positron emission tomography (FDG-PET)/computed tomography (CT) response measurement. Nowadays, the imaging modality that is considered the reference standard for staging HL is FDG-PET/CT [2,4,5]. Unfortunately, FDG-PET/CT is accompanied by the exposure to ionizing radiation. The overall paediatric HL survival rates are around 95% [1,6]. Therefore, children with HL will generally have a long life expectancy after their treatment, which implicates a long time frame in which long-term side effects of their radiation exposure during diagnosis and treatment can occur. The administered ionizing radiation dose is 5 millisievert (mSV) per FDG-PET/CT in the University Medical Center Utrecht, depending on whether a low-dose CT or a high-dose contrast-enhanced CT is used. In other centres, the ionizing radiation doses are reportedly higher, up to 23 ± 11 mSV per FDG-PET/CT, especially since a contrast-enhanced CT is still part of standard procedures in many hospitals [7]. Since during staging and follow-up repeated imaging is required, the radiation dose accumulates to even higher levels. Combined with the increased susceptibility of children to the effects of ionizing radiation exposure [8], children with HL are at risk of developing secondary malignancies during their further lifetime [9][10][11][12][13][14].
Whole-body magnetic resonance imaging with diffusionweighted imaging (WB-MRI with DWI) is a radiation-free method which allows imaging of the body with excellent soft tissue contrast in a single examination and could therefore be an attractive alternative to FDG-PET/CT for the staging of HL in children [15][16][17][18]. With the addition of DWI to the WB-MRI protocols, it is suggested that not only anatomical but functional information is provided as well, offering a possible surrogate to the functional information provided with FDG-PET/ CT [19]. The evidence for the use of WB-MRI with DWI for staging HL in children, although increasing, is still limited [16,17,20,21]. The aim of this study was to compare the concordance of WB-MRI (including DWI) and FDG-PET/CT for initial staging in children with Hodgkin lymphoma in order to contribute to the development of evidence-based 'radiation reduced' imaging protocols in paediatric Hodgkin lymphoma.

Study population
All European patients were included in the Euronet PHL-C1 trial (First International Inter-group Study for Classical Hodgkin's Lymphoma in Children and Adolescents) [22,23]. Inclusion criteria were age 7-18 years with newly diagnosed, histologically proven HL. All patients were included between March 2012 and January 2016. Exclusion criteria were general contraindications for MRI (e.g. pacemaker, metallic implant and claustrophobia), previous malignancies, and breastfeeding or pregnancy.

Procedures
Patients underwent both an FDG-PET/CT and WB-MRI at staging before start of treatment. The FDG-PET/CT was made as part of standard clinical care and WB-MRI was always performed within 15 days of the FDG-PET/CT (median 1.00 days, interquartile range (IQR), 4.00). Full descriptions of the WB-MRI and FDG-PET/CT protocols used by all participating centres are provided in the supplementary material file. WB-MRI sequence parameters are shown in supplementary table 1. All images were de-identified and collected for review.

Whole-body MRI image interpretation
The de-identified WB-MR images were analysed by two independent radiologists (R.A.J.N. and T.C.K. with 25 and 10 years of MRI experience, respectively) using OsiriX Lite Medical Imaging Software (Pixmeo) or Horos (Horos Project). The readers were aware of the diagnosis of HL, but had no access to other information such as clinical data and other imaging findings. Analyses and scoring were performed using a standardized form based on the Euronet PHL C1 trial [22,23]. The readers evaluated first the WB-MRI without DWI (T1-weighted and T2-weighted only) and second the WB-MRI including DWI immediately thereafter. Disease presence was scored for 10 nodal and all possible extranodal regions (e.g. thoracic, abdominal, central nervous system and musculoskeletal sites). Nodal regions were cervical, axillary, infraclavicular, mediastinal, hilar, spleen, para-aortic, mesenteric, para-iliac and inguinal. The relevant extra-nodal regions were lung, liver and bone marrow. Table 1 summarizes the criteria for involvement of the different nodal and extra-nodal regions. Finally, the disease stage was determined for each reading [24]. Discrepancies between the datasets from both WB-MRI readers were solved by a third reader (A.S.L, radiologist, with 15 years of MRI experience) to form the final consensus WB-MRI datasets (both with and without DWI).

FDG-PET/CT image interpretation
The de-identified FDG-PET/CT images were analysed by a nuclear medicine physician (B.d.K., with 15 years of FDG-PET/CT experience) using OsiriX Lite Medical Imaging Software. The reader was blinded to clinical data and other imaging findings not related to the lymphoma diagnosis. Disease presence was scored either positive (e.g. FDG uptake above uptake in mediastinum and/or liver) or negative for 10 nodal and all extra-nodal stations. Disease stage was reported [24].
Expert panel: forming the reference standard and intrinsic WB-MRI dataset An independent expert panel reviewed all discrepancies between consensus WB-MRI and FDG-PET/CT scoring results. The expert panel consisted of a radiologist (A.S.L.) and a nuclear medicine physician (N.T., with 9 years of FDG-PET/CT experience). The expert panel had access to all the available clinical and imaging information. All discrepancies were assessed and labelled as reader error or intrinsic error. Reader errors were being caused by failure of the reader to detect the abnormality (perceptual error) or by an incorrect interpretation of an abnormal finding (interpretation error). Intrinsic errors could either be due to limitations of the imaging acquisition or technique (e.g. if the abnormality was outside the field of view or if the error was caused by severe artefacts). Reader errors were corrected for WB-MRI including DWI to form the intrinsic WB-MRI reading. The FDG-PET/CT-based reference standard was formed by correcting all FDG-PET/CT reader and intrinsic errors.

Statistical analysis
The statistical analyses were performed using the Statistical Package for the Social Sciences (SPSS), version 25.0, and the R statistical software package version 3.5.1 (R Development Core Team).
Concordance between WB-MRI without DWI, WB-MRI with DWI and intrinsic WB-MRI and the FDG-PET/CTbased reference standard was assessed by calculating total agreement, positive predictive values (PPV), negative predictive values (NPV) and Cohen's kappa statistics. Those were calculated between WB-MRI and the reference standard for lymphoma detection per patient (disease stage) and for presence/absence of disease in the separate nodal and extranodal stations as well as for the combined nodal and extranodal stations. Kappa values for staging agreement with and without DWI were compared and tested as proposed by Vanbelle to determine the additional value of DWI [25]. A p value < 0.05 was considered statistically significant.
Sensitivity and specificity of WB-MRI without DWI, WB-MRI with DWI and intrinsic MRI for staging were calculated against the reference standard.
Inter-observer agreement between the two WB-MRI readers was assessed using percentages of observed and specific agreement (expressing the agreements for positive and negative ratings separately) and kappa statistics [26].
For all analyses of the combined nodal and extra-nodal stations, clustering within patients had to be considered. Multilevel analyses were performed as proposed by Vanbelle et al [25] for the kappa statistics. For observed agreement, PPV and NPV a mixed-effect logistic regression model was used, taking clustering within patients into account using random intercepts.

Patient characteristics
Seventy-six patients were found eligible and were prospectively included between 2012 and 2016. Eight patients were excluded due to no informed consent (n = 1), incomplete MRI study (n = 2) or logistic circumstances (n = 5; reasons included patient too sick to undergo WB-MRI (n = 2), no certain HL diagnosis at scheduled time of WB-MRI (n = 1), WB-MRI scheduled at the same time as another examination (n = 2)). All remaining 68 patients underwent WB-MRI and FDG-PET/CT for staging. The baseline characteristics, including age, gender, HL subtype and disease stage, are shown in Table 2. Table 3 summarizes the inter-observer agreement between both WB-MRI readers. Overall agreement between readers for disease stage (limited versus advanced disease) was 88% (60/68, 95% CI 0.78-0.93) and kappa agreement was good (κ 0.74, 95% CI 0.58-0.91). The specific agreement on the positive ratings was overall lower compared to the specific agreement on the negative ratings, indicating that readers were more likely HL Hodgkin lymphoma, SD standard deviation Disease infiltration into extra-lymphatic structure or organ that is adjacent to a lymph node mass to agree on negative than positive rating for disease presence. The lymph node stations with the lowest agreements were hilar (74%, 50/68) and infraclavicular (77%, 52/68). For all other stations, the observed agreements were ≥ 90%.

Reference standard and intrinsic WB-MRI data
The expert panel identified a total of 43 discrepant disease sites (5% of all examined disease sites) between consensus WB-MRI and FDG-PET/CT (31 nodal and 12 extra-nodal sites). Twenty-three FDG-PET/CT reader errors in 17 patients were corrected (19 perception errors and 4 interpretation errors), and no intrinsic FDG-PET/CT errors were identified. Figure 1 shows an example of an interpretation error by the FDG-PET/CT reader. To obtain the intrinsic WB-MRI dataset, the expert panel identified and corrected 9 WB-MRI reader errors in 8 patients. The corrected errors were 8 perception errors and 1 interpretation error. The perception errors were located in the following stations: spleen (3 patients), axillary, hilar, para-iliac, para-aortic and mediastinal. The error of interpretation that was corrected was caused by misinterpretation of a cervical lesion due to the placement of a central venous catheter (lack of clinical information) (Fig. 2).

Consensus WB-MRI versus FDG-PET/CT-based reference standard
The additional value of DWI to T1-weighted and T2-weighted images in staging paediatric HL was assessed by comparing the consensus WB-MRI dataset with and without DWI to the FDG-PET/CT-based reference standard (  CI confidence interval, NA not applicable *No multilevel analysis available discrepancies in 9 patients was found when comparing the datasets with and without DWI (Table 5). These differences were found in the following stations: hilar (3 patients), paraaortic (2 patients), mesenteric (2 patients), liver (1 patient) and bone marrow (2 patients) (Fig. 3). Staging results improved in 5 out of 9 patients with the addition of DWI as compared to the FDG-PET/CT-based reference standard. Out of these 5 patients, WB-MRI without DWI would have resulted in upstaging in three patients, whereas it would have resulted in downstaging in two patients. Sensitivity and specificity for staging paediatric HL using consensus WB-MRI without DWI were 96% (95% CI 0.78-1.00) and 89% (95% CI 0.76-0.96) respectively, whereas for WB-MRI including DWI, the sensitivity and specificity increased to 100% (95% CI 0.85-1.00) and 96% (95% CI 0.85-0.99) respectively.

Discussion
This prospective, multicentre, international study in 68 children with newly diagnosed Hodgkin lymphoma compared the concordance of WB-MRI with and without DWI to an FDG-PET/CT-based reference standard for the initial staging of paediatric HL.
Results show a good inter-observer agreement between the WB-MRI readers for both nodal and extra-nodal staging. Previous studies found comparable agreements [16,17]. The lymph node stations with the highest amount of discrepancies between WB-MRI DWI readers were infraclavicular, hilar and mesenteric. This was mostly due to labelling errors. Mesenteric lymph nodes were scored as para-aortic; infraclavicular lymph nodes were sometimes mistaken for cervical lymph nodes, and hilar lymph nodes were marked as mediastinal or vice versa. In most cases, these labelling errors did not affect the conclusions on determining disease stage. Although motion artefacts were present in part of the WB-MRI scans (mainly cardiac or respiratory motion artefacts), these artefacts did not cause (labelling) errors.
For disease stage, the agreement between the reference standard and the intrinsic WB-MRI was 97.1%. This sounds promising, but for the two discrepant cases, this would have had relevant implications for treatment planning in clinical practice. According to WB-MRI, both patients would have been staged stage 3, implicating advanced disease and thus a more intensive treatment scheme, whereas the FDG-PET/CTbased reference standard staged both patients stage 2 which is considered limited disease with, also due to the absence of Bsymptoms in both patients, a less intensive treatment regime. In both cases, the discrepancy was caused by an enlargement of the spleen without being FDG-PET/CT positive. The size criterion for WB-MRI thus caused an inaccuracy for detecting disease presence in the spleen. In clinical practice, most patients receive an ultrasound examination at first presentation as well. This provides extra information regarding splenic involvement which was not considered for this study. With the addition of DWI to the WB-MRI reading, the agreement on disease stage improved for five patients, as is shown in Table 5. This difference in agreement on disease stage was statistically significant (p = 0.036).
The concordance between intrinsic WB-MRI DWI and the FDG-PET/CT-based reference standard was 100% for extranodal disease. For nodal disease, the concordance was 99%. The agreements found in this study resemble those found in recent literature, as Latifoltojar et al recently reported 99% concordance for nodal disease and > 99% for extra-nodal disease for their WB-MRI reading after removal of perceptual errors [17]. The 100% overall agreement in determining extranodal disease that was found for both WB-MRI DWI and intrinsic WB-MRI DWI implies that no lung lesions have been missed by WB-MRI DWI. Detection of lung lesion is the most important reason to also perform a separate CT examination nowadays, which causes extra exposure to ionizing radiation.
When assessing the separate lymph node stations, the agreement between intrinsic WB-MRI and the reference standard was good to excellent for all stations for both WB-MRI with and without DWI. With the addition of DWI, the agreement with the reference standard remained the same or improved for all stations. Main improvements were seen for the hilar (κ 0.88 to κ 0.94), para-aortic (κ 0.81 to κ 0.88), mesenteric (κ 0.82 to κ 1.00) and bone marrow (κ 0.88 to κ 1.00) stations. Therefore, in line with our previous results, DWI was mainly of additional value in the abdominal lymph node stations [28].
There are a few limitations of this study that need to be addressed. An unblinded expert panel created the FDG-PET/ CT-based reference standard. This was performed in a consensus reading and with the availability of all collected data. Due to the lack of a true gold standard, this was the best available option to form a reference standard. This method for creating a reference standard when no true gold standard is available has been used by others as well [16,17,29]. Since the expert panel used all available data, the WB-MRI data and reference standard were not completely independent of each other. However, this method does resemble clinical practice in which final decisions are made in consensus. For this study, the differences between WB-MRI and the reference standard might be underestimated due to this design.
Furthermore, all reader errors from the WB-MRI DWI reading were removed to create the intrinsic WB-MRI dataset. Although this probably provides the best available WB-MRI results, reader errors are of course part of daily clinical practice as well and the intrinsic WB-MRI is thus likely be an overestimation of reality. However, it can also be argued that the intrinsic MRI does resemble clinical practice in the best way possible since in clinical practice all patients are discussed in multidisciplinary meetings where not only imaging results, but clinical, histological and laboratory findings are considered as well. The intrinsic MRI showed only small increases in agreements when compared to WB-MRI including DWI. Therefore, the overestimation of the intrinsic WB-MRI accuracy seems to be limited. In contrary to the WB-MRI reading, the FDG-PET/CT reading was performed by only one experienced reader as a result of which no inter-observer agreement could be determined. This limitation was overcome by the expert panel (including a nuclear medicine physician), who created the FDG-PET/CT-based reference standard by assessing all available information.
Finally, the focus of this study was the initial staging of paediatric HL. During the course of diagnosis, treatment and follow-up, children with HL are exposed to multiple imaging examinations. As a consequence, the amount of administered ionizing radiation can accumulate to significant levels, especially when considering that contrast-enhanced CT (CE-CT) is still widely used for imaging paediatric HL. The performance of WB-MRI in other phases of the disease process, such as response evaluation and restaging, might differ from the performance for staging. Although response assessment has recently been addressed in the literature by a few studies, the use of WB-MRI for both response evaluation and restaging does need further investigation [17,30].
To conclude, inter-observer agreement of WB-MRI DWI was good for both nodal and extra-nodal staging and for determining disease stage. The addition of DWI to the WB-MRI protocol in staging of paediatric HL improved staging agreement with the FDG-PET/CT-based reference standard. Concordance between intrinsic WB-MRI DWI and the FDG-PET/CT-based reference standard was excellent, but did not reach 100%, due to discrepancies in staging of splenic involvement.
Funding This project was financially supported by Stichting Kinderen Kankervrij (KiKa, project number 87). The collection, analysis and interpretation of data, the writing of the paper and the decision to submit were not influenced by KiKa.

Compliance with ethical standards
Guarantor The scientific guarantor of this publication is R.A.J. Nievelstein.

Conflict of interest
The authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.

Statistics and biometry
No complex statistical methods were necessary for this paper.
Informed consent Written informed consent was obtained for all included children and/or their parents or legal guardians.
Ethical approval Institutional Review Board approval was obtained.

Methodology
• prospective • cross-sectional study/diagnostic study • multicentre study Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.