iRECIST-based versus non-standardized free text reporting of CT scans for monitoring metastatic renal cell carcinoma: a retrospective comparison

Purpose Therapy decision for patients with metastatic renal cell carcinoma (mRCC) is highly dependent on disease monitoring based on radiological reports. The purpose of the study was to compare non-standardized, common practice free text reporting (FTR) on disease response with reporting based on response evaluation criteria in solid tumors modified for immune-based therapeutics (iRECIST). Methods Fifty patients with advanced mRCC were included in the retrospective, single-center study. CT scans had been evaluated and FTR prepared in accordance with center’s routine practice. For study purposes, reports were re-evaluated using a dedicated computer program that applied iRECIST. Patients were followed up over a period of 22.8 ± 7.9 months in intervals of 2.7 ± 1.8 months. Weighted kappa statistics was run to assess strength of agreement. Logistic regression was used to identify predictors for different rating. Results Agreement between FTR and iRECIST-based reporting was moderate (kappa 0.38 [95% CI 0.2–0.6] to 0.70 [95% CI 0.5–0.9]). Tumor response or progression according to FTR were not confirmed with iRECIST in 19 (38%) or 11 (22%) patients, respectively, in at least one follow-up examination. With FTR, new lesions were frequently not recognized if they were already identified in the recent prior follow-up examination (odds ratio for too favorable rating of disease response compared to iRECIST: 5.4 [95% CI 2.9–10.1]. Conclusions Moderate agreement between disease response according to FTR or iRECIST in patients with mRCC suggests the need of standardized quantitative radiological assessment in daily clinical practice. Supplementary Information The online version contains supplementary material available at 10.1007/s00432-022-03997-0.


Introduction
The incidence of renal cell carcinoma up to age 75 is 4.4/100 000 worldwide (2.4% of new cancer cases, 1.7% of cancer deaths) (Ferlay et al. 2015) with higher incidence rates and declining mortality in developed countries (Znaor et al. 2015). For the survival of patients, it is crucial that radiological treatment monitoring provides substantial information for therapeutic decision making. However, a recent study suggests that the value of radiologic reports considerably depends on the methodical approach of assessment (Goebel et al. 2017).
Whilst in routine clinical practice, quantitative assessment criteria for morphologic CT image interpretation are rarely standardized and the method of free text reporting (FTR) is common, in clinical studies, consideration of standardized response evaluation criteria in solid tumors (RECIST) is required. With FTR, in general, no specific criteria for evaluation are defined. Radiologists routinely refer to the most previous finding. In contrast, RECIST based reporting refers either to baseline or to nadir. Moreover, with RECIST, selection of target lesions is specified. RECIST were updated in 2009 (validated RECIST 1.1 guideline) and are mainly to be applied in case of cytotoxic chemotherapy (Eisenhauer et al. 2009) whereas immune-RECIST (iRECIST consensus guideline) were proposed in 2017 and are supposed to be applied in patients who receive immunotherapy (Seymour et al. 2017). With immunotherapy, even initial progression due to infiltration of various immune cells into the tumor with subsequent reduction or stabilization of the tumor size (pseudoprogression) is associated with prolonged survival Aykan et al. 2020). Application of iRE-CIST for the assessment of metastatic renal cell carcinoma (mRCC) associated tumor burden, supported by semi-automated comparison using specified references (i.e.: baseline or nadir) may improve radiological assessment and reporting quality even in all day clinical practice.
The purpose of our study was to evaluate whether assessment of disease response in patients with mRCC according to common practice FTR without specified evaluation criteria sufficiently agrees with that based on software-aided application of iRECIST.

Study population
This retrospective, single-center study was approved by the institutional review board.
We searched the institutional medical database and included 50 consecutive patients with mRCC who had been treated with immunotherapy between January 2015 and October 2020. Patients with a measurable tumor burden at baseline according to current guidelines on tumor response criteria (RECIST 1.1 and iRECIST Eisenhauer et al. 2009;Seymour et al. 2017)), who had completed at least three follow-up examinations with contrast-enhanced CT of thorax, abdomen, and pelvis were eligible for inclusion.

CT data acquisition
CT scans were performed using Revolution CT (GE Healthcare) with a detector width of 160 mm. After intravenous administration of a 1 ml/kg body weight bolus of contrast agent (Accupaque 350 mg, GE Healthcare) followed by a 50 ml saline flush, the imaging started with a bolus-triggered technique (monitoring frequency from 10 s after contrast injection: 1 per second; trigger threshold: an increase of 100 HU in the descending aorta; delay from trigger to initiation of scan: 15 s). For CT scans, parameters were as follows: 120 kVp, automatically set mAs values, reconstructed to a slice thickness of 3 mm and a pitch of 53.

Image analysis
Based on morphological evaluation of CT scans, FTRs had been prepared as part of daily clinical practice by two radiologists in agreement. At least one of them had more than 5 years of experience concerning assessment of tumor burden. Overall, 58 radiologists had participated in preparation of the considered FTRs. FTRs covered both description of pathologies and clinical interpretation including indication of tumor development. Assessment criteria had not been specified. Tumor burden after a given treatment had usually been compared to the recent prior CT examination, however, reference CT was not mentioned explicitly. For study purposes, if not already done by the examiner, a resident physician of the radiology department retrospectively assigned interpretation of FTRs to one of the following four disease response categories: complete response (CR), partial response (PR), stable disease (SD), or progressive disease (PD).
In parallel, we imported the same image datasets into a commercially available semi-automatic software (mint lesion version 3.7.3, MINT Medical GmbH) (Goebel et al. 2017). A radiologist with 13 years of experience, blinded to FTRs, retrospectively selected target lesions for baseline entries. Classification of target lesions was based on specified criteria including size, number per organ system, and reproducible measurability according to RECIST 1.1 guidelines. Lesions which did not meet these criteria were classified as non-target lesions (Eisenhauer et al. 2009). Subsequently, the radiologist manually measured the longest/ shortest axis of each lesion with the aid of the software. The software automatically summed up the longest diameter of non-nodal target lesions and the short axis diameter of nodal target lesions. Lesions from follow-up CT scans were detected with support of the software and measured manually. The software again calculated the sum of lesion axis diameters at every follow-up to automatically compare it with the appropriate reference and assigned the respective disease category. According to iRECIST (Seymour et al. 2017), baseline CT serves as reference to determine response to treatment (iCR or iPR) and stable disease (iSD). Nadir (smallest sum of diameters so far) serves as reference to determine progression. There are two definitions of progress: unconfirmed progressive disease (iUPD) and confirmed progressive disease (iCPD: triggered by further progress after iUPD) (Online Resource: ESM Table 1). In addition, image analysis based on RECIST 1.1 was conducted to estimate to what extent pseudoprogression contributed to different assessment of tumor development.

Study outcome
In this study, unstandardized FTR on disease response was compared to software-supported assessment using iRE-CIST. The latter had been chosen as comparator because, in contrast to RECIST 1.1, iRECIST take effects of immunotherapy into account that may mimic tumor progression (pseudoprogression) (Seymour et al. 2017;Ma et al. 2019). Outcome of primary interest was strength of agreement regarding change in tumor burden according to FTR and iRECIST, quantified as weighted kappa. Secondary outcomes were proportionate agreement between the two approaches and associations of selected variables with different rating among FTR and iRECIST-based reports.

Statistical analysis
Categorical variables are presented as counts and percentages and continuous variables as means and standard deviations. Agreement was assessed with Cohen's kappa statistics. Amount of difference in rating was considered using weighted kappa. Strength of agreement was interpreted according to Landis and Koch (1977), (kappa ≤ 0.00: poor; 0.01-0.20: slight; 0.21-0.40: fair; 0.41-0.60: moderate; 0.61-0.80: substantial; 0.81-1.0: almost perfect agreement). Mann-Kendall test was used to determine whether agreement in tumor assessment had a trend over the series of follow-up examinations. Univariable analysis using logistic regression was applied to assess associations between selected variables with different rating of tumor burden with FTR or iRECIST. The p value threshold was adjusted for multiple testing (p < 0.006). Analysis was performed using StatsDirect statistical software version 2.8.0. (StatsDirect Ltd.) and XLSTAT version 2015.6.01.24026 (Addinsoft).

Study population
A total of 50 patients (64 ± 10 years, 68.0% male sex) with mRCC were included in the study. Prognostic risk according to international metastatic renal cell carcinoma database consortium (IMDC) assessment was poor in 20% of patients. Average number of target lesions per patient was 2.4 ± 1.4. The most common sites of target and nontarget lesions were lung (33.0%), lymph nodes (30.0%), kidney (7.1%), and liver (6.6%) ( Table 1). Patients completed 8.4 ± 2.3 CT follow-up evaluations over a period

Free text reporting versus iRECIST-based reporting
Rating of change in tumor burden according to either FTR or iRECIST differed in 30-57% of patients throughout followups. In 8-32% of patients, response category was assessed more favorable, and in 14-26% less favorable with FTR. Ratings differed by one (24-52% of patients) or two (0-6% of patients) categories (Fig. 1a).
With FTR, disease was preferably reported as SD at all follow-ups. Assessment of PR was 8-22 and PD was 10-32 percentage points less with FTR compared to iRECISTbased reports from the 3rd follow-up on. In contrast, with iRECIST, proportion of SD ratings declined from 76% at the 1st to 18% at the 10th follow-up. Overall, FTR on response or progression was not confirmed by iRECIST-based reports in 22% (21 of 96 CT evaluations, 19 patients) or 16% (12 of 76 CT evaluations, 11 patients), respectively (Fig. 2). Two of the patients in whom FTR indicated PD but iRECIST stated SD (2/11) underwent immediate change of treatment. However, change was due to diarrhea in one of them.

Predictors of different assessment
Univariable analysis revealed that first occurrence of new lesions decreased the odds of different assessment according to FTR compared to iRECIST (odds ratio [OR] 0.02 (95% CI 0.001-0.4). In contrast, the presence of already existing new lesions raised the odds that disease response was assessed as better according to FTR than to iRECIST by more than fivefold (OR 5.4 (95% CI 2.9-10.1). Both an increase in the sum of target lesion diameters and in the number of target lesions decreased the odds of disease response of being assessed as lower according to FTR compared to iRECIST (OR 0.9 per 10 mm [95% CI: 0.8-1.0] and OR 0.8 per lesion [95% CI 0.6-0.9], respectively). Involvement of lymph nodes nearly doubled the odds of different assessment (OR 1.77 (95% CI 1.1-2.7). Every month from baseline increased the odds that disease response was assessed as better according to FTR than to iRECIST (OR 1.0 per 30 days from baseline [95% CI 1.0-1.1]) (Fig. 4).

Discussion
We conducted a retrospective comparison of FTR with reporting based on software-supported application of iRE-CIST to assess change in tumor burden on grounds of anatomically unidimensional CT scan measurement in patients with mRCC. Agreement between both approaches was only moderate. According to unstandardized FTR, new lesions which were already present in recent prior follow-ups were frequently not recognized as such. Different assessment following FTR compared to iRECIST was more frequently seen if lymph nodes were target lesions.
In daily clinical practice, evaluation of tumor burden in addition to symptomatic criteria represents a crucial parameter for therapy control. Although, the iRECIST guideline basically does not apply for clinical decision making (Seymour et al. 2017) and criteria still need to be validated, standardized response criteria could facilitate objectivity of assessment even in daily practice.
A previous trial on tumors of various origins (Goebel et al. 2017) reported on fair to moderate agreement between FTR and RECIST 1.1 based reports. In most cases, different reporting with FTR was attributed to assignment of even minor changes in tumor burden to PR or PD instead of SD. a Non-clear cell RCC: papillary RCC, collecting duct RCC b IMDC prognostic risk was rated as follows: 0 risk factors, favorable; 1-2 risk factors, intermediate; ≥ 3 risk factors, poor (Motzer et al. 2019) c No cancer cells seen microscopically at the primary tumor site d Nuclear grading system based on microscopic morphology, which evaluates nuclear size, shape, and nucleolar prominence e T stage describes the size and extent of the primary tumor (Lam et al. 2009)  Another reason for discrepancies were comparisons to the most recent prior follow-up examinations instead to baseline or nadir. The latter was confirmed in our study. In addition, with FTR, new lesions that were already present were frequently not recognized as still existing new lesions, with the result of a too favorable rating. In case of lymph nodes, Fig. 1 Difference in assessment of response categories between FTR and iRECIST. Proportion of different assessment (a) and strength of agreement (b) regarding response category in patients with metastatic renal cell carcinoma with either FTR or iRECIST. Strength of agreement is presented as weighted kappa with 95% confidence interval.
Regarding iRECIST, unconfirmed progressive disease (iUPD) was put on the same level with confirmed progressive disease (iCPD). FTR, free text reporting; iRECIST, response evaluation criteria in solid tumors for immune-based therapeutics discrepancies might be based on lesion selection (iRECIST: a maximum of two nodes in total, even from different nodal basins) and measurement (iRECIST: at least 15 mm in short axis). Additionally, we found that the odds of too worse rating with FTR was reduced with increasing number and sum of diameters of target lesions. However, it seems conclusive that in advanced disease, worse rating is more frequently appropriate.
Assessment of pseudoprogression with iRECIST can only be determined with certainty after 4 weeks from the first detection of iUPD. After 4-8 weeks, disease could further progress and trigger iCPD or could stay Fig. 2 Distribution of disease response categories according to approach of reporting. CR complete response, iCPD immune confirmed progressive disease, iCR immune complete response, iPR immune partial response; iSD, immune stable disease, iUPD immune unconfirmed progressive disease, PD progressive disease, PR partial response, SD stable disease progredient compared to nadir (iUPD) or show stable disease (iSD), partial response (iPR), or complete response (iCR) compared to baseline. This new assessment should prevent patients to be withdrawn from immunotherapy in case of pseudoprogression. An earlier study found that continued immunotherapy beyond iUPD can prolong survival (George et al. 2016). Such a review of assessment is not automatically provided with FTR, however, could be visualized using longitudinal analysis and graphical methods (Shen et al. 2014). In our study, FTR investigators identified pseudoprogression in some of the patients with increased lymph node axis despite of decreased sum of parenchymal lesion diameters. Frequency of pseudoprogression, determined with iRECIST was similar to previous findings in patients with mRCC (9-14%) (Queirolo et al. 2017).

Fig. 3
Unconfirmed progression assessed with iRECIST in a 62-yearold male patient presenting with tumorous lymph nodes. At the 6th follow-up, target lesion response was rated as stable because short axis of the lymph node increased by ≥ 20% but by < 5 mm and thus did not meet criteria for progression. New lesion assessment was rated as iUPD because measure of short axis was still > 10 mm but did not increase by ≥ 5 mm from the initial detection. Thus, progressive disease was not confirmed, and overall response status was calculated as iUPD. B baseline, iUPD immune unconfirmed progressive disease, LA long axis, N nadir, P previous follow-up, SA short axis For unexperienced readers, even RECIST contains pitfalls that may origin in baseline lesion selection, reassessment of lesions, and identification of new lesions (Abramson et al. 2015;Keil et al. 2014) identified the choice of target lesions as major source of disagreement between readers, even with consideration of RECIST 1.1. Thus, target lesion selection probably remains a subjective confounder in the assessment of tumor development. Target lesions should be unequivocally metastases, but not postoperative seroma or granulation tissue. Lesions within a radiation field should not be selected as target lesions unless progression is demonstrated. At follow-up, lesions should be remeasured in the same phase of contrast. Finally, even in case of axis shift, it is required to remeasure the true long (parenchymal lesions) or short axis (nodal lesions), respectively (Abramson et al. 2015).
RECIST/iRECIST are well described and known in clinical practice, however, rarely implemented. This may be due to need for assignment of the appropriate reference CT which requires request for baseline information including treatment modalities from oncologists. Moreover, Fig. 4 Association of selected variables with assessment of disease response category depending on the approach of reporting. Based on univariable logistic regression a For lymph nodes, actual short axis measurement was included in the sum diameter reproducibility is supposed to increase when reference and follow-up CT scan is read by the same radiologist (Muenzel et al. 2012;Olthof et al. 2018). According to Krajewski et al. intra-rater variability in CT tumor size measurement ensures reproducibility of 10% tumor shrinkage measured by a single radiologist (Krajewski et al. 2014). However, this approach might be difficult to apply in small radiology departments. Thus, software-aided application of RECIST/ iRECIST, however, can be a sufficient and time-saving support to increase reproducibility of the reporting. From this one might conclude that the large number of radiologists who prepared FTRs might have contributed to the decreased agreement with iRECIST-based reports conducted by a single radiologist.
This study has some limitations. First, assessment criteria applied for FTR were not reported. In addition, assignment of FTR interpretation to disease categories for study purposes was not conducted by oncologists, the actual receivers, but by radiologists. Anyway, assignment left scope for interpretation. Furthermore, only a single reader conducted assessment according to iRECIST and we did neither systematically consider treatment decisions nor clinical outcomes. However, as observed in patients in whom FTR indicated progressive-but iRECIST found stable disease, treatment decision, was not necessarily associated with CT assessment. Finally, our small-scale trial was conducted retrospectively at a single center and thus, should be considered as exploratory.

Conclusions
In conclusion, utility of radiological assessment for treatment stratification increases with objectivity and traceability of reports. Standardized assessment criteria constitute a sound basis for quantitative morphologic evaluation of disease development in patients with mRCC. Additional text-only qualitative reports may be better suited to be seen by the patient (Travis et al. 2014). Due to unsatisfactory agreement between FTR and iRECIST-based reporting, we recommend extending application of iRECIST from clinical studies to routine clinical practice. Dedicated software may support implementation.