Introduction

Diffuse large B-cell lymphoma (DLBCL) represents the most common subtype of adult non-Hodgkin lymphoma (NHL) cases, and is associated with an aggressive clinical course. There are several potentially effective first-line chemotherapy regimens of which most consist of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP). The addition of the monoclonal antibody rituximab (R) to this regimen (R-CHOP) has significantly improved the outcome of DLBCL patients [1, 2]. However, treatment failure is still an important problem as the 3-year progression-free survival (PFS) of DLBCL patients is approximately 60–70% [3].

Commonly used prognostic indices are the International Prognostic Index (IPI) [4, 5], or the more powerful Revised-IPI (R-IPI) [6], and National Comprehensive Cancer Network IPI (NCCN-IPI) [7]. These indices can be used for risk-stratification to predict a poor outcome after R-CHOP. It is important to identify a poor outcome as soon as possible because these patients could benefit from a switch to a second-line treatment or high-dose chemotherapy (HDCT) with autologous stem cell transplantation (ASCT) as an upfront treatment [8]. 18F-fluoro-2-deoxy-D-glucose (18F-FDG) positron emission tomography (PET) after a few cycles of therapy, also known as interim 18F-FDG PET, is of increasing interest, as it may facilitate early change of treatment and prevent unnecessary side effects [9]. In recent decades several visual criteria for interpretation of 18F-FDG PET have been developed, for example, the EORTC, PERCIST, and International Harmonization Project (IHP) criteria as well as the Deauville scoring system [9,10,11,12,13]. Nowadays the latter is widely adopted for interpretation of response evaluation with 18F-FDG PET in DLBCL [9, 13].

Interim 18F-FDG PET has shown high predictive value in Hodgkin lymphoma [14]; however, according to previous reviews, the role of interim 18F-FDG PET in DLBCL is still unknown [15,16,17,18]. From these studies it can be concluded that heterogeneity in patient populations, therapy regimens, PET scanners, timing of the interim 18F-FDG PET scans, and/or differences in the visual criteria used for interpretation of the interim 18F-FDG PET scans made it hard to clarify the accuracy of interim 18F-FDG PET to predict clinical outcome in DLBCL.

Therefore, we performed a new systematic review and meta-analysis, focusing on DLBCL patients only, assessing both the hazard ratio (HR) and diagnostic parameters (sensitivity, specificity, and predictive values) of interim 18F-FDG PET on PFS or event-free survival (EFS) in patients with DLBCL treated with first-line immuno-chemotherapy regimens. The primary outcome measure was PFS (preferably) or EFS at 2 years, since DLBCL patients who are event-free after 24 months have demonstrated an overall survival (OS) comparable to an age- and sex-matched general population [19]. In order to reduce the previously described heterogeneity we performed several subgroup analyses, for example, by the type of 18F-FDG PET scanner and the type of visual criteria used for interpretation of the interim 18F-FDG PET scans. In this meticulously performed review we contacted the authors for additional information if necessary.

Materials and methods

Search strategy

For this systematic review and meta-analysis we searched in collaboration with a medical librarian Pubmed/MEDLINE, Embase, and the Cochrane Library databases from onset until July 11, 2017 with a language restriction to English, French, Dutch, or German. Our search strategy contained a combination of various indexed terms and free text words for “positron emission tomography” and “non-Hodgkin lymphoma” (full search strategy Supplemental Table 1). We included full-text publications of original prospective and retrospective studies. Excluded were conference abstracts, letters, comments, editorials, review articles, animal studies, and case reports. Reference lists of included articles were checked to identify additional eligible studies.

Study selection: Eligibility criteria

Patients

Adult patients treated with first-line immuno-chemotherapy regimens for stage I-IV DLBCL were considered as our target population. We excluded studies that investigated HIV-related lymphoma, central nervous system (CNS) lymphoma involvement, or post-transplant lymphoproliferative disease (PTLD). Studies containing less than 80% of DLBCL subtype were excluded, unless subgroup data for DLBCL were presented or if the remaining 20% had PMBCL or FL grade 3B [20]. Studies including ten patients or less were classified as case series and therefore also excluded.

Treatment procedures

Studies in which a change of treatment was based on the interim 18F-FDG PET result and prospective PET-adapted trials were not included. However, we allowed a change of therapy in patients with clinical evidence of progressive disease during first-line treatment [9].

We included all R-CHOP-like treatments as first-line treatment strategies [1, 2, 21,22,23], but we excluded studies if ≤50% of patients received rituximab. Therapies using other (new generation) monoclonal antibodies were excluded.

Studies with autologous stem cell transplantation (ASCT) were eligible if this strategy was part of the preplanned first-line treatment. Radiotherapy was accepted if the decision to give radiotherapy was preplanned or used for consolidation of PET positive sites at the end of first-line treatment, but not affected by interim 18F-FDG PET results. If studies did not report on the use of ASCT or radiotherapy, we assumed that no ASCT or radiotherapy was given based on interim 18F-FDG PET result.

Interim 18F-FDG PET procedures

An interim 18F-FDG PET scan should have been performed after the first, second, third, or fourth treatment cycle. PET only as well as PET/CT systems were considered eligible. Use of other radiopharmaceuticals than 18F-FDG were not accepted.

We focused on visual interpretation criteria only, as nowadays, semi-quantitative PET strategies are used for research purposes only and are not standard in the current guidelines yet [13]. PET response criteria were grouped into three categories: Deauville score (DS) on a 5-point scale [9, 13], International Harmonization Project (IHP) [12], and custom visual criteria (i.e. not based on consensus guidelines).

Outcome measures

The primary outcome measure was defined as PFS (preferably) or EFS at 2 years. We included studies with a minimum median follow-up period of 24 months in surviving patients (or for the entire study population), because most patients experience relapse or progression of their disease in the first 2 years after their diagnosis [24, 25].

Data extraction and quality assessment

After removing duplicates, two authors independently screened titles and abstracts of the search results for eligibility (CNB and NH, AdJ, or HCWdV). The decision to include studies in the review was based on the full-text articles (CNB and AdJ or HCWdV). Extensive data extraction forms (available upon request) were developed which included the criteria from the methodological checklists for diagnostic accuracy studies (QUADAS-2) [26] and for prognostic studies (QUIPS) [27]. The forms were tested in a few articles and used independently by two review authors (CNB, AdJ). Consensus meetings (with three experts in nuclear medicine, hematology, and methodology, respectively) were organized to solve disagreements and to decide on eligibility of the final study selection. Besides general information about study design, patients, treatment, interim 18F-FDG PET performance, and outcome measures (used for qualitative study descriptions and determination of eligibility) we extracted outcomes on two types of predictive parameters.

For the first predictive meta-analysis we extracted univariate hazard ratios (HRs) and their corresponding 95% confidence intervals. If this data was not reported and not provided after contacting the authors, we used the methods of Tierney et al. [28] to deduce these from reported parameters or from the Kaplan-Meier (KM) curves, using numbers at risk when available.

For the second predictive meta-analysis we used a diagnostic approach and constructed 2 × 2 contingency tables to calculate sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of interim 18F-FDG PET for prediction of two-year PFS and - EFS. If no two-year survival percentages were reported we estimated the percentages from the KM curves at this time-point. If information was missing or unclear authors were contacted. A maximum of three reminders were sent. In case of no reply we used the information that was available from the original publication. Individual patient data was not requested for this meta-analysis.

Statistical analyses

Two approaches of meta-analysis

For the meta-analyses of the HRs, individual log hazard ratios (HRs) and standard errors (SE) were pooled using a random-effects model (REML, restricted maximum likelihood). Together with the individual study results, the pooled effect estimate—expressed as HR and 95% confidence interval—was visualized in a Forest plot. Between-study heterogeneity was assessed by using Cochran’s Q and I2 statistics [29]. A 95% prediction interval around the HR was calculated to predict the expected range of the HR of a new (future) study [30]. A funnel plot was presented to visually assess if publication bias was likely [31].

For the diagnostic meta-analysis, the pooled sensitivity and specificity was obtained by Hierarchical Summary ROC curve (HSROC) models and ROC curves constructed in RevMan [32] using the input parameters of the HSROC models.

Influence of covariates

Several prespecified subgroup analyses—which included both clinical and methodological issues—were performed using univariate meta-regression models for the HRs and as covariate interaction term in the HSROC models. The following subgroup analyses were performed: study design (retrospective or prospective studies; blinded review or not reported; PFS or EFS), characteristics of patients (100% DLBCL or between 80 and 100%), treatments (ASCT upfront or not, preplanned or consolidative radiotherapy used or unknown), properties of scans (PET/CT or a combination of PET/CT and PET standalone systems, availability of a baseline PET or CT), and scoring issues (DS -, IHP -, or custom criteria, central review or local review).

Software

Statistical analysis was performed in R (version 3.2.5) [33] using the Metafor package and SAS Proc Nlmixed was used for the HSROC models. A P value of less than 0.05 was considered statistically significant.

Results

The search yielded 9,960 records after removing duplicates; 290 concerned studies on NHL and interim FDG-PET, the other 9,670 records were excluded because they did not report on NHL or I-PET. 85/290 were potentially eligible and full-text articles were retrieved. After checking detailed inclusion and exclusion criteria we included 20 eligible studies in the qualitative systematic review; 19 out of 20 were eligible for the HRs evaluations and 18 out of 20 for the HSROC analyses (Fig. 1).

Fig. 1
figure 1

PRISMA flow diagram. *Records refer to the title and abstract screening of the search results. Full-text articles refer to the full-text assessment of the selected articles from the title and abstract screening phase. Abbreviations: I-PET = interim 18F-FDG positron emission tomography, FLT = Fluorothymidine, DLBCL = diffuse large B-cell lymphoma, EoT-PET = end-of-treatment 18F-FDG positron emission tomography, HR = hazard ratio, HSROC = hierarchical summary receiver operating curve

A total of 2,411 newly diagnosed DLBCL patients from 20 studies were assessed for this analysis. Table 1 shows the main study-, patient-, and treatment characteristics of the included studies. The number of included patients per study ranged from 32 to 327 (median 112, interquartile range 70–142). Seven studies had a prospective study design. The median age of the patients ranged from 54 to 65 years, with the exception of one study with a median age of 46 [40], and 45–67% of the patients were of male gender. Most studies included patients with Ann Arbor stage I/II as well as stage III/IV; in two studies less than 50% of the patients had stage III or IV [37, 45] and one study included patients with stage III and IV only [51]. First-line treatment regimens varied between and within the studies, but R-CHOP was the basic principle in all studies. Radiotherapy was given in most of the studies to selected patients (preplanned, e.g. in case of bulky disease or as a consolidation for residual lymphoma sites after treatment). Autologous stem cell transplantation had been planned upfront in three studies [44, 48, 50].

Table 1 Study- and patient characteristics

In Table 2 details of PET procedures, interpretation, and timing of interim PET between cycles are shown. Most studies performed an interim PET scan after two cycles of chemotherapy in all patients, one study made interim PET scans after only one course in all patients [43]; the remaining studies combined patient groups who had their interim assessment after a variable number of treatment cycles. The number of days after the previous treatment course at which the interim PET was acquired also varied between studies, mostly just before the next chemotherapy cycle, but the number of days after previous treatment was not reported by all studies. Twelve studies applied the Deauville scoring system and four the International Harmonization Project system [40, 46, 48, 51]. The remaining studies used a custom scoring system [42, 50, 52, 53].

Table 2 Interim 18F-FDG PET characteristics

The outcome measures of the included studies are shown in Table 3: 16 studies presented PFS and the other four studies reported EFS. The definitions of PFS and EFS for the different studies are presented in Supplemental Table 2. Percentages of positive interim PET scans ranged from 18.1 to 56.3%. Five original publications had reported univariate HRs, and four authors provided a (re)calculated HR upon our request. Two authors provided information about the number of events and P-values in order to use the method from Tierney et al. [28]. For one study we extracted the HR from the KM curves with numbers at risk provided by the authors and for six studies we used the KM curves without numbers at risk. For two studies we could not extract the HRs, as there was insufficient data and no Kaplan-Meier curve [36, 48].

Table 3 Study results; prognostic and diagnostic information

In Fig. 2 the Forest plot with the 18 univariate HRs is shown. The pooled effect estimate was 3.13 (95% CI 2.52–3.89). The Cochran’s Q test for heterogeneity was not statistically significant (P = 0.087) and between study heterogeneity was low (I2 = 35.14%). The 95% prediction interval was 1.68–5.83, with one outlier [37].

Fig. 2
figure 2

Forest plot of univariate hazard ratios for interim PET scans in diffuse large B-cell lymphoma. This plot shows the univariate hazard ratios (black squares, size based on study size), and 95% CI’s (horizontal lines) of the individual studies sorted by publication year for PFS/EFS of the interim PET positive and negative patients. The estimated pooled effect estimation is shown with a diamond. For each study a 2 × 2 contingency table at 2 years follow-up is shown

The methodological quality was assessed based on the QUADAS-2 and QUIPS checklists. Subgroup analyses were performed on study design characteristics that were potential sources of bias.

Meta-regression showed that the outcomes did not differ between retrospective and prospective studies, studies with blinded review and studies that did not report whether they blinded the PET/CT assessment, or studies that used PFS or EFS as outcome measure. A statistically significant higher HR was found for studies with a combination of integrated PET/CT- and PET standalone systems compared to studies with integrated PET/CT systems only (HR 4.39 vs 2.85, P = 0.0332) and a trend towards a higher HR in studies with 80–99% DLBCL compared to studies with 100% DLBCL (P = 0.0577). Prespecified subgroups for different types of treatments and FDG-PET scoring systems showed no statistically significant differences (Supplemental Table 3). For the subgroups “availability of baseline PET or CT” and “central or local review procedure”, insufficient information was reported to perform these analyses. Risk of publication bias as assessed with a Funnel plot was low (Supplemental Fig. 1).

Nineteen studies had data available for the calculation of PPV, NPV, sensitivity, and specificity of interim PET for prediction of two-year-PFS or -EFS. For one study we could not extract or calculate the diagnostic measures [48]. PPV and NPV ranged from 20 to 74% and 64 to 95%, respectively. Sensitivity and specificity ranged from 33 to 87% and 49 to 94%, respectively (Table 3, Supplemental Fig. 2).

In Fig. 3 the ROC curves of the different visual criteria are shown. The studies that were classified as “custom”, did not have comparable scan positivity definitions and therefore no summary curve for this group was presented. We found no statistically significant differences between the curves for Deauville and IHP. There was a trend (P = 0.0503) towards a higher accuracy for studies with DLBCL 80–99% versus studies with 100% DLBCL patients.

Fig. 3
figure 3

Summary receiver operating curves (sROC) for different visual interim PET criteria. Studies that assessed interim PET scans according to the Deauville’s criteria are indicated with blue circles, studies that used the international harmonization project (IHP) criteria are indicated with red diamonds and studies that used custom visual criteria are indicated with green squares. The size of the circles, diamonds, and squares are based on the inverse standard error

Discussion

This systematic review and meta-analysis included 20 studies comprising a total of 2,411 DLBCL patients who underwent interim 18F-FDG PET. Eighteen studies were eligible for the HR and 19 for the HSROC meta-analyses. We found a pooled estimated HR of 3.13 (95% CI 2.52–3.89) for interim PET in the prediction of PFS or EFS. The prediction interval ranged from 1.68 to 5.83, suggesting that a new study investigating the prognostic value of interim PET on PFS or EFS will find a HR in this range with 95% confidence. These results confirm the predictive value of interim PET in DLBCL patients for PFS and EFS. Our pooled estimated HR was lower than reported in a previous meta-analysis (2013) [16] which reported a pooled estimated HR of 4.4 (95% CI 3.34–5.81) from nine studies investigating the prediction of PFS by interim PET. They used a similar approach to extract HRs; however, they had less strict inclusion criteria with regard to the NHL types and follow-up period, both visual and semi-quantitatively assessed PET scans were included, and no subgroup analyses were performed. Despite these differences, their HR result is within the range of our calculated 95% prediction interval and the amount of statistical heterogeneity (I2 = 39%) amongst studies was comparable. Other meta-analyses did not compare the HRs between studies [15, 17, 18].

We have no explanation for the statistically significant higher HR for studies (n = 5) that used both PET/CT- and PET standalone systems compared to studies that used an integrated PET/CT system.

The trend towards a higher HR for the studies with both DLBCL and PMBCL patients compared to studies with only DLBCL patients could not directly be explained by the inclusion of both lymphoma subtypes. The fact that two out of three studies with both DLBCL and PMBCL patients [52, 53] used custom criteria for the interpretation of the interim PET could possibly explain this. These meta-regression results should be interpreted with caution, as the number of studies per subgroup were relatively low (Supplemental Table 3) which precludes multivariate meta-regression analysis.

Diagnostic 2 × 2 contingency tables of interim PET showed wide ranges between studies for sensitivity, specificity, and positive predictive values at 2 years. The ranges reported in other systematic reviews and meta-analyses were hard to compare as they used the complete follow-up period for their calculations, included studies with follow-up periods less than 24 months, and used other statistical methods [15, 17, 18]. We decided to truncate at 2 years, as most clinically relevant events occur during this period. Moreover, the widely ranging complete follow-up periods of individual studies might introduce bias.

Negative predictive values for 2-year progression-free status were generally above 80%, except in four studies [34, 35, 39, 53]. In Mamot et al. [39], the somewhat lower negative predictive value could possibly be explained because radiotherapy (administered regardless of PET results) was counted as an event and resulted in a lower EFS rate compared to other clinical trials. Zhao et al. [53] had a low percentage of negative interim PET scans and a high number of events, which explains the lower NPV.

The higher sensitivity values seen in ROC analysis for both IHP and custom criteria vs. the Deauville system may be explained by the lower threshold of test positivity with IHP vs. Deauville (using liver and blood pool activity as the reference tissue, respectively). None of the studies using custom criteria defined a threshold comparable to or higher than hepatic uptake. We found widely ranging positivity rates between studies, which are mainly in agreement with the timing of interim PET between cycles and the criteria used. In an exploratory analysis on five studies [34, 37,38,39, 47] that performed interim PET strictly after 2 cycles of therapy and applied the Deauville scoring system we found a pooled estimated HR of 3.48 (95% CI 2.46–4.93) with a corresponding 95% prediction interval of 1.58–7.67 (Supplemental Fig. 3). The positivity rates for these studies ranged between 18 and 46%, PPV from 37 to 74% and NPV from 76 to 91%, comparable to the analysis including all studies.

We chose to present the methodological characteristics along the other characteristics of the study population and treatments (Table 1) and along characteristics (including timing between cycles) of the index test (Table 2).

QUADAS-2 and QUIPS criteria were applied to assess the quality of the studies from the perspective of risk of bias and applicability. In this review, the strict inclusion and exclusion criteria with regard to patient population (>80% DLBCL), index test (interim PET between one and five treatment cycles), and reference standard (PFS and EFS) guaranteed the applicability of the results to the review question. In the subgroup analyses we examined whether bias could have occurred because of methodological shortcomings. It appeared that none of these affected the results. Only characteristics of the population (< 100% DLBCL) and a combination of integrated and standalone systems seemed to have impact on the predictive value of interim PET.

We used a comprehensive search strategy and applied strict inclusion and exclusion criteria. We focused on DLBCL patients, and 2-year PFS. Moreover, we examined the influence of different design characteristics (retrospective and prospective, blinded review or not reported; PFS or EFS), characteristics of patients (100% DLBCL or between 80 and 100%), treatments (ASCT upfront or not, preplanned or consolidative radiotherapy used or unknown), availability of a baseline PET or CT, properties of scans (PET/CT or a combination of PET/CT and PET standalone systems), and scoring issues (DS -, IHP -, or custom criteria, central review or local review). Only the patient characteristics and properties of scans affected the results. It appeared that the HR estimates of the included studies were quite homogeneous (I2 = 35%).

By contacting the authors we were able to include most of the eligible studies in our meta-analysis and deducting data that was not presented by the authors directly. Some data though were hard to obtain from the studies.

First of all, the definition of the start of the progression-free survival and event-free survival differed amongst studies. Some studies started their follow-up period at the time from diagnosis and others from initiation of first-line treatment. Recently some data has shown that patients who have a more aggressive disease tend to be treated earlier, so there could be selection bias between studies that have a shorter period between time of diagnosis and initiation of treatment versus studies with a longer period [54]. For future studies it seems important to have a comparable start of the follow-up period and authors should report the interval between diagnosis and start of the treatment to prevent or adjust for this risk of bias.

Another issue is that timing of the interim PET scans between cycles was different between studies; not only did the timing after which cycle the scan is performed differ, but also the number of days between the previous treatment course and interim PET. Unfortunately, not all authors report on this, although it is recommended to perform the scan at least 10 days after the previous course of chemotherapy, because of possible effects on tumor metabolism and systemic effects by, for example, growth factors [55].

In systematic reviews, investigators need to make choices. We chose to use the univariate data. This choice was made because univariate data were available in most studies and because of the large heterogeneity in factors for which the HR was adjusted in the primary articles. The adjusted factors were limited by the low number of events in most studies and partially based on available information such as quantitative PET analyses, immunohistochemistry and collection of specific clinical data (e.g. bone marrow involvement). Fourteen of the 20 studies performed a multivariate analysis. Most articles adjusted for the IPI score [34,35,36,37,38,39, 41,42,43, 46] or age-adjusted IPI [44, 48, 49], some dichotomized the score and others used the individual components. Results were varying widely; in some studies both interim PET and (aa)IPI showed an independent association with PFS or EFS [42, 48], others only for interim PET [34, 37, 39, 41, 44, 53], or (aa)IPI [43, 49] or no independent associations were found for both interim PET or (aa)IPI [35, 36, 38, 46]. One could argue that reporting univariate HRs instead of multivariate HRs could result in an overestimation of the predictive value of interim PET. Three studies reported both uni- and multivariate HRs and differences between univariate and multivariate HRs were −0.99 [41], 0.0 [39],and + 0.2 [42], respectively.

We further decided to choose the DS threshold for the interim response criteria which is most commonly described (DS < = 3 versus DS > = 4), because presenting all thresholds would increase heterogeneity, influence effect sizes, and finally use the same patients data multiple times in the analyses. Four studies presented multiple scores. Mylam et al. [43] published data about positivity for Deauville scores 4 and 5 as well as for Deauville score 5 and for IHP. Kim et al. [35] and Itti et al. [47] presented data about different positivity cutoff values for Deauville scores. Fuertes et al. [45] published a regular Deauville score as well as a 3 point-scale. In this review, we focused on visual response assessment criteria, and the potential added value of quantitative PET metrics is currently being investigated. Recently, a large phase III PET-adapted trial showed in a post-hoc analysis that a SUVmax reduction strategy [56] seems to discriminate better between good and poor outcome compared to the Deauville scoring system [57]. Finally, it should be mentioned that the studies from Safar et al. [50] and Itti et al. [47] had a small overlap in patient inclusion (n = 7); however, this will presumably not bias our results due to the small number.

Conclusion

This systematic review and meta-analysis shows that interim PET in DLBCL patients has predictive value (HR 3.13). However, some diagnostic test characteristics are still too low, especially the positive predictive value should be improved, before a risk stratified treatment approach can be implemented in clinical practice.