Introduction

The incidence of renal cell carcinoma up to age 75 is 4.4/100 000 worldwide (2.4% of new cancer cases, 1.7% of cancer deaths) (Ferlay et al. 2015) with higher incidence rates and declining mortality in developed countries (Znaor et al. 2015). For the survival of patients, it is crucial that radiological treatment monitoring provides substantial information for therapeutic decision making. However, a recent study suggests that the value of radiologic reports considerably depends on the methodical approach of assessment (Goebel et al. 2017).

Whilst in routine clinical practice, quantitative assessment criteria for morphologic CT image interpretation are rarely standardized and the method of free text reporting (FTR) is common, in clinical studies, consideration of standardized response evaluation criteria in solid tumors (RECIST) is required. With FTR, in general, no specific criteria for evaluation are defined. Radiologists routinely refer to the most previous finding. In contrast, RECIST based reporting refers either to baseline or to nadir. Moreover, with RECIST, selection of target lesions is specified. RECIST were updated in 2009 (validated RECIST 1.1 guideline) and are mainly to be applied in case of cytotoxic chemotherapy (Eisenhauer et al. 2009) whereas immune-RECIST (iRECIST consensus guideline) were proposed in 2017 and are supposed to be applied in patients who receive immunotherapy (Seymour et al. 2017). With immunotherapy, even initial progression due to infiltration of various immune cells into the tumor with subsequent reduction or stabilization of the tumor size (pseudoprogression) is associated with prolonged survival (Ma et al. 2019; Aykan et al. 2020). Application of iRECIST for the assessment of metastatic renal cell carcinoma (mRCC) associated tumor burden, supported by semi-automated comparison using specified references (i.e.: baseline or nadir) may improve radiological assessment and reporting quality even in all day clinical practice.

The purpose of our study was to evaluate whether assessment of disease response in patients with mRCC according to common practice FTR without specified evaluation criteria sufficiently agrees with that based on software-aided application of iRECIST.

Methods

Study population

This retrospective, single-center study was approved by the institutional review board.

We searched the institutional medical database and included 50 consecutive patients with mRCC who had been treated with immunotherapy between January 2015 and October 2020. Patients with a measurable tumor burden at baseline according to current guidelines on tumor response criteria (RECIST 1.1 and iRECIST Eisenhauer et al. 2009; Seymour et al. 2017)), who had completed at least three follow-up examinations with contrast-enhanced CT of thorax, abdomen, and pelvis were eligible for inclusion.

CT data acquisition

CT scans were performed using Revolution CT (GE Healthcare) with a detector width of 160 mm. After intravenous administration of a 1 ml/kg body weight bolus of contrast agent (Accupaque 350 mg, GE Healthcare) followed by a 50 ml saline flush, the imaging started with a bolus-triggered technique (monitoring frequency from 10 s after contrast injection: 1 per second; trigger threshold: an increase of 100 HU in the descending aorta; delay from trigger to initiation of scan: 15 s). For CT scans, parameters were as follows: 120 kVp, automatically set mAs values, reconstructed to a slice thickness of 3 mm and a pitch of 53.

Image analysis

Based on morphological evaluation of CT scans, FTRs had been prepared as part of daily clinical practice by two radiologists in agreement. At least one of them had more than 5 years of experience concerning assessment of tumor burden. Overall, 58 radiologists had participated in preparation of the considered FTRs. FTRs covered both description of pathologies and clinical interpretation including indication of tumor development. Assessment criteria had not been specified. Tumor burden after a given treatment had usually been compared to the recent prior CT examination, however, reference CT was not mentioned explicitly. For study purposes, if not already done by the examiner, a resident physician of the radiology department retrospectively assigned interpretation of FTRs to one of the following four disease response categories: complete response (CR), partial response (PR), stable disease (SD), or progressive disease (PD).

In parallel, we imported the same image datasets into a commercially available semi-automatic software (mint lesion version 3.7.3, MINT Medical GmbH) (Goebel et al. 2017). A radiologist with 13 years of experience, blinded to FTRs, retrospectively selected target lesions for baseline entries. Classification of target lesions was based on specified criteria including size, number per organ system, and reproducible measurability according to RECIST 1.1 guidelines. Lesions which did not meet these criteria were classified as non-target lesions (Eisenhauer et al. 2009). Subsequently, the radiologist manually measured the longest/shortest axis of each lesion with the aid of the software. The software automatically summed up the longest diameter of non-nodal target lesions and the short axis diameter of nodal target lesions. Lesions from follow-up CT scans were detected with support of the software and measured manually. The software again calculated the sum of lesion axis diameters at every follow-up to automatically compare it with the appropriate reference and assigned the respective disease category. According to iRECIST (Seymour et al. 2017), baseline CT serves as reference to determine response to treatment (iCR or iPR) and stable disease (iSD). Nadir (smallest sum of diameters so far) serves as reference to determine progression. There are two definitions of progress: unconfirmed progressive disease (iUPD) and confirmed progressive disease (iCPD: triggered by further progress after iUPD) (Online Resource: ESM Table 1). In addition, image analysis based on RECIST 1.1 was conducted to estimate to what extent pseudoprogression contributed to different assessment of tumor development.

Study outcome

In this study, unstandardized FTR on disease response was compared to software-supported assessment using iRECIST. The latter had been chosen as comparator because, in contrast to RECIST 1.1, iRECIST take effects of immunotherapy into account that may mimic tumor progression (pseudoprogression) (Seymour et al. 2017; Ma et al. 2019). Outcome of primary interest was strength of agreement regarding change in tumor burden according to FTR and iRECIST, quantified as weighted kappa. Secondary outcomes were proportionate agreement between the two approaches and associations of selected variables with different rating among FTR and iRECIST-based reports.

Statistical analysis

Categorical variables are presented as counts and percentages and continuous variables as means and standard deviations. Agreement was assessed with Cohen’s kappa statistics. Amount of difference in rating was considered using weighted kappa. Strength of agreement was interpreted according to Landis and Koch (1977), (kappa ≤ 0.00: poor; 0.01–0.20: slight; 0.21–0.40: fair; 0.41–0.60: moderate; 0.61–0.80: substantial; 0.81–1.0: almost perfect agreement). Mann–Kendall test was used to determine whether agreement in tumor assessment had a trend over the series of follow-up examinations. Univariable analysis using logistic regression was applied to assess associations between selected variables with different rating of tumor burden with FTR or iRECIST. The p value threshold was adjusted for multiple testing (p < 0.006). Analysis was performed using StatsDirect statistical software version 2.8.0. (StatsDirect Ltd.) and XLSTAT version 2015.6.01.24026 (Addinsoft).

Results

Study population

A total of 50 patients (64 ± 10 years, 68.0% male sex) with mRCC were included in the study. Prognostic risk according to international metastatic renal cell carcinoma database consortium (IMDC) assessment was poor in 20% of patients. Average number of target lesions per patient was 2.4 ± 1.4. The most common sites of target and non-target lesions were lung (33.0%), lymph nodes (30.0%), kidney (7.1%), and liver (6.6%) (Table 1). Patients completed 8.4 ± 2.3 CT follow-up evaluations over a period of 22.8 ± 7.9 months. Intervals between follow-ups were 2.7 ± 1.8 months (Online Resource: ESM Table 2).

Table 1 Patient and disease characteristics

Free text reporting versus iRECIST-based reporting

Rating of change in tumor burden according to either FTR or iRECIST differed in 30–57% of patients throughout follow-ups. In 8–32% of patients, response category was assessed more favorable, and in 14–26% less favorable with FTR. Ratings differed by one (24–52% of patients) or two (0–6% of patients) categories (Fig. 1a).

Fig. 1
figure 1

Difference in assessment of response categories between FTR and iRECIST. Proportion of different assessment (a) and strength of agreement (b) regarding response category in patients with metastatic renal cell carcinoma with either FTR or iRECIST. Strength of agreement is presented as weighted kappa with 95% confidence interval. Regarding iRECIST, unconfirmed progressive disease (iUPD) was put on the same level with confirmed progressive disease (iCPD). FTR, free text reporting; iRECIST, response evaluation criteria in solid tumors for immune-based therapeutics

Strength of agreement between FTR and iRECIST-based reports was fair to substantial throughout CT follow-up examinations. Weighted kappa ranged from 0.38 (95% CI 0.2–0.6) to 0.70 (95% CI 0.5–0.9). There was no trend in the series of weighted kappa over subsequent follow-ups (Kendall’s Tau -0.05, p = 0.93), (Fig. 1b). Weighted kappa increased when iRECIST was experimentally run without considering nadir or baseline but instead only referring to the respective preceding examination, analogous to the approach of FTR (0.47 [95% CI 0.29–0.65] to 0.92 [95% CI 0.71–1.12]), (Online Resource: ESM Fig. 1).

With FTR, disease was preferably reported as SD at all follow-ups. Assessment of PR was 8–22 and PD was 10–32 percentage points less with FTR compared to iRECIST-based reports from the 3rd follow-up on. In contrast, with iRECIST, proportion of SD ratings declined from 76% at the 1st to 18% at the 10th follow-up. Overall, FTR on response or progression was not confirmed by iRECIST-based reports in 22% (21 of 96 CT evaluations, 19 patients) or 16% (12 of 76 CT evaluations, 11 patients), respectively (Fig. 2). Two of the patients in whom FTR indicated PD but iRECIST stated SD (2/11) underwent immediate change of treatment. However, change was due to diarrhea in one of them.

Fig. 2
figure 2

Distribution of disease response categories according to approach of reporting. CR complete response, iCPD immune confirmed progressive disease, iCR immune complete response, iPR immune partial response; iSD, immune stable disease, iUPD immune unconfirmed progressive disease, PD progressive disease, PR partial response, SD stable disease

Pseudoprogression

Agreement between assessment of change in tumor burden using RECIST 1.1 or iRECIST was almost perfect (weighted Cohen’s kappa: 0.89 (95% CI 0.63–1.14) to 1.0 (95% CI 0.81–1.19), Online Resource: ESM Fig. 2. Different assessment was caused by pseudoprogression in 5 patients (identified with iRECIST at 11 follow-ups) representing 10% of the total study cohort. Pseudoprogression was observed in lymph nodes (3 lesions), lung (3 lesions), and pleura (1 lesion). In 4 of 11 (36%) examinations, pseudoprogression was also covered by FTR. Examples for assessment of progression with iRECIST are provided in Fig. 3 and Online Resource: ESM Fig. 3.

Fig. 3
figure 3

Unconfirmed progression assessed with iRECIST in a 62-year-old male patient presenting with tumorous lymph nodes. At the 6th follow-up, target lesion response was rated as stable because short axis of the lymph node increased by ≥ 20% but by < 5 mm and thus did not meet criteria for progression. New lesion assessment was rated as iUPD because measure of short axis was still > 10 mm but did not increase by ≥ 5 mm from the initial detection. Thus, progressive disease was not confirmed, and overall response status was calculated as iUPD. B baseline, iUPD immune unconfirmed progressive disease, LA long axis, N nadir, P previous follow-up, SA short axis

Predictors of different assessment

Univariable analysis revealed that first occurrence of new lesions decreased the odds of different assessment according to FTR compared to iRECIST (odds ratio [OR] 0.02 (95% CI 0.001–0.4). In contrast, the presence of already existing new lesions raised the odds that disease response was assessed as better according to FTR than to iRECIST by more than fivefold (OR 5.4 (95% CI 2.9–10.1). Both an increase in the sum of target lesion diameters and in the number of target lesions decreased the odds of disease response of being assessed as lower according to FTR compared to iRECIST (OR 0.9 per 10 mm [95% CI: 0.8–1.0] and OR 0.8 per lesion [95% CI 0.6–0.9], respectively). Involvement of lymph nodes nearly doubled the odds of different assessment (OR 1.77 (95% CI 1.1–2.7). Every month from baseline increased the odds that disease response was assessed as better according to FTR than to iRECIST (OR 1.0 per 30 days from baseline [95% CI 1.0–1.1]) (Fig. 4).

Fig. 4
figure 4

Association of selected variables with assessment of disease response category depending on the approach of reporting. Based on univariable logistic regression aFor lymph nodes, actual short axis measurement was included in the sum diameter

Discussion

We conducted a retrospective comparison of FTR with reporting based on software-supported application of iRECIST to assess change in tumor burden on grounds of anatomically unidimensional CT scan measurement in patients with mRCC. Agreement between both approaches was only moderate. According to unstandardized FTR, new lesions which were already present in recent prior follow-ups were frequently not recognized as such. Different assessment following FTR compared to iRECIST was more frequently seen if lymph nodes were target lesions.

In daily clinical practice, evaluation of tumor burden in addition to symptomatic criteria represents a crucial parameter for therapy control. Although, the iRECIST guideline basically does not apply for clinical decision making (Seymour et al. 2017) and criteria still need to be validated, standardized response criteria could facilitate objectivity of assessment even in daily practice.

A previous trial on tumors of various origins (Goebel et al. 2017) reported on fair to moderate agreement between FTR and RECIST 1.1 based reports. In most cases, different reporting with FTR was attributed to assignment of even minor changes in tumor burden to PR or PD instead of SD. Another reason for discrepancies were comparisons to the most recent prior follow-up examinations instead to baseline or nadir. The latter was confirmed in our study. In addition, with FTR, new lesions that were already present were frequently not recognized as still existing new lesions, with the result of a too favorable rating. In case of lymph nodes, discrepancies might be based on lesion selection (iRECIST: a maximum of two nodes in total, even from different nodal basins) and measurement (iRECIST: at least 15 mm in short axis). Additionally, we found that the odds of too worse rating with FTR was reduced with increasing number and sum of diameters of target lesions. However, it seems conclusive that in advanced disease, worse rating is more frequently appropriate.

Assessment of pseudoprogression with iRECIST can only be determined with certainty after 4 weeks from the first detection of iUPD. After 4–8 weeks, disease could further progress and trigger iCPD or could stay progredient compared to nadir (iUPD) or show stable disease (iSD), partial response (iPR), or complete response (iCR) compared to baseline. This new assessment should prevent patients to be withdrawn from immunotherapy in case of pseudoprogression. An earlier study found that continued immunotherapy beyond iUPD can prolong survival (George et al. 2016). Such a review of assessment is not automatically provided with FTR, however, could be visualized using longitudinal analysis and graphical methods (Shen et al. 2014). In our study, FTR investigators identified pseudoprogression in some of the patients with increased lymph node axis despite of decreased sum of parenchymal lesion diameters. Frequency of pseudoprogression, determined with iRECIST was similar to previous findings in patients with mRCC (9–14%) (Queirolo et al. 2017).

For unexperienced readers, even RECIST contains pitfalls that may origin in baseline lesion selection, reassessment of lesions, and identification of new lesions (Abramson et al. 2015; Keil et al. 2014) identified the choice of target lesions as major source of disagreement between readers, even with consideration of RECIST 1.1. Thus, target lesion selection probably remains a subjective confounder in the assessment of tumor development. Target lesions should be unequivocally metastases, but not postoperative seroma or granulation tissue. Lesions within a radiation field should not be selected as target lesions unless progression is demonstrated. At follow-up, lesions should be remeasured in the same phase of contrast. Finally, even in case of axis shift, it is required to remeasure the true long (parenchymal lesions) or short axis (nodal lesions), respectively (Abramson et al. 2015).

RECIST/iRECIST are well described and known in clinical practice, however, rarely implemented. This may be due to need for assignment of the appropriate reference CT which requires request for baseline information including treatment modalities from oncologists. Moreover, reproducibility is supposed to increase when reference and follow-up CT scan is read by the same radiologist (Muenzel et al. 2012; Olthof et al. 2018). According to Krajewski et al. intra-rater variability in CT tumor size measurement ensures reproducibility of 10% tumor shrinkage measured by a single radiologist (Krajewski et al. 2014). However, this approach might be difficult to apply in small radiology departments. Thus, software-aided application of RECIST/iRECIST, however, can be a sufficient and time-saving support to increase reproducibility of the reporting. From this one might conclude that the large number of radiologists who prepared FTRs might have contributed to the decreased agreement with iRECIST-based reports conducted by a single radiologist.

This study has some limitations. First, assessment criteria applied for FTR were not reported. In addition, assignment of FTR interpretation to disease categories for study purposes was not conducted by oncologists, the actual receivers, but by radiologists. Anyway, assignment left scope for interpretation. Furthermore, only a single reader conducted assessment according to iRECIST and we did neither systematically consider treatment decisions nor clinical outcomes. However, as observed in patients in whom FTR indicated progressive- but iRECIST found stable disease, treatment decision, was not necessarily associated with CT assessment. Finally, our small-scale trial was conducted retrospectively at a single center and thus, should be considered as exploratory.

Conclusions

In conclusion, utility of radiological assessment for treatment stratification increases with objectivity and traceability of reports. Standardized assessment criteria constitute a sound basis for quantitative morphologic evaluation of disease development in patients with mRCC. Additional text-only qualitative reports may be better suited to be seen by the patient (Travis et al. 2014). Due to unsatisfactory agreement between FTR and iRECIST-based reporting, we recommend extending application of iRECIST from clinical studies to routine clinical practice. Dedicated software may support implementation.