Introduction

Lung cancer is the second most prevalent malignancy worldwide, with approximately 2.2 million newly diagnosed cases in 2020 [1], the majority of which are nonsmall cell lung cancer (NSCLC), comprising nearly 84% of cases [2]. NSCLC stage I and II are typically surgically managed, while treatment for locally advanced unresectable stage III and metastatic stage IV often necessitates adjuvant radiotherapy, frequently combined with chemotherapy and sometimes immunotherapy.

Despite therapeutic advancements, there has been only marginal improvement in the 5-year survival rates for stage III/IV from 24.6% in 2016 to 26.4% in 2020 [2]. Consequently, the research focus has shifted towards screening, diagnosis, and personalized management strategies to ameliorate both quality of life and survival outcomes.

Radiomics, an emerging field, leverages noninvasive techniques to extract radiomic features (RFs) from medical images, surpassing standard radiology reporting. RFs, also known as texture analysis, capture grey-level intensities and spatial relationships within the region of interest (ROI) in two-dimensional (2D) pixel and three-dimensional (3D) voxel spaces, hypothesized to be associated with tissue heterogeneity and tumor microenvironment [3,4,5,6,7]. A primary objective of radiomics is to provide predictive imaging biomarkers that, in conjunction with clinical parameters, could improve diagnosis and treatment prognostication, quality of life, and overall survival (OS), aligning with personalized and precision medicine goals.

Despite the substantial volume of NSCLC radiomics research, the translation into clinical practice has been constrained by technical and methodological challenges, resulting in studies with low statistical power and decreased replicability, reproducibility, and generalizability [3, 8,9,10,11,12,13]. Quality scoring tools and checklists, such as the Radiomics Quality Score (RQS) with 16 items and a maximum point score of 36, and the CheckList for EvaluAtion of Radiomics Research (CLEAR) with 58 items but without point-scoring, have been developed to address these challenges [10, 14]. However, their adoption has been limited, and concerns persist regarding their reliability in uniformly assessing the quality of radiomics research [9, 15].

Our study aims to 1) identify promising radiomics biomarkers in stage III/IV NSCLC treated with radiation in the literature and 2) critically appraise the research pipeline using the recently published CLEAR and longer-existing RQS systems, and merge the wording of both CLEAR and RQS frameworks into a comprehensive checklist (CLEAR-RQS) allowing a comparison between CLEAR-RQS point-scoring against CLEAR and RQS [9, 10]. CLEAR-RQS aims to serve as a valuable resource to radiomics researchers and educators across various disciplines.

Materials and methods

For this research, IRB approval was not required since it does not include any human subjects or include any identifiable private information.

Objective 1: PRISMA literature search to identify radiomics studies in stage III/IV NSCLC patients treated with radiotherapy

We conducted a literature search of online databases MEDLINE, PubMed, and SCOPUS from June to August 2023. Search fields comprised of [Stage III NSCLC OR Stage IV NSCLC OR nonsmall cell lung cancer] AND [radiotherapy OR SABR (stereotactic ablative body radiation) OR SBRT (stereotactic body radiation therapy)] AND [CT radiomic OR [quantitative AND imaging] OR [texture AND feature]]. Initial title and abstract analyses were performed by K.T. (3rd-year graduate medical student) with subsequent full-text screening assessment by K.T. and H.S.K. (radiologist with 20 years of general and 16 years of oncological imaging subspecialty knowledge). The final article selection comprised original research in human studies with articles written in the English language on CT radiomics in post-radiotherapy stage III/IV NSCLC (Table 1). Figure 1 shows the PRISMA flow diagram of the literature search.

Table 1 PRISMA literature search
Fig. 1
figure 1

PRISMA flow diagram of PubMed, MEDLINE, and SCOPUS literature search

Literature data extraction and analysis

Article data extraction included cohort size, radiotherapy/ CT technique, utilized radiomics software, selected RFs, and study endpoints.

Critical appraisal of full-text articles was performed regarding the following research questions: 1) are there commonly selected RFs for treatment response, adverse events, and/ or outcomes in patients undergoing radiotherapy? 2) are there factors within the research study design that would impede reproducibility?

Objective 2: critical appraisal of selected articles applying CLEAR and RQS frameworks and development of a comprehensive radiomics assessment checklist (CLEAR-RQS)

All articles were assessed by three readers, D.G. (radiologist with 4 years of general radiology experience), K.T., and H.S.K., utilizing the RQS metrics and the CLEAR/ CLEAR-RQS criteria [10, 14]. To facilitate a direct comparison between RQS and CLEAR/ CLEAR-RQS, a point score of 1 for “yes” and of 0 for “no” or “NA” responses was assigned to each CLEAR/ CLEAR-RQS item, resulting in a maximal possible score of 58 for CLEAR and 61 for CLEAR-RQS.

The mean score from all three readers was utilized to compare the RQS, CLEAR, and CLEAR-RQS frameworks. To enable a relative comparison between frameworks, the score of each tool was proportionally converted to a percentage based on its metric (e.g., 100% equated to a CLEAR point score of 58, a CLEAR-RQS score of 61, and an RQS score of 36).

K.T. and H.S.K. systematically compared the wording and interpretation of all 58 CLEAR and 16 RQS items (Table 2). To prevent redundancy, identical and very similar items were merged, retaining the wording of the more specific source framework (CLEAR or RQS). No new wording was introduced to ensure adherence to the respective source framework.

Table 2 CLEAR-RQS checklist

Results

Objective 1: PRISMA literature search

Figure 1 demonstrates the PRISMA diagram, which outlines the literature search. In total, 871 articles were found (PubMed n = 403, MEDLINE n = 249, SCOPUS n = 219). After the exclusion of 462 duplicates, 409 article abstracts were screened. This resulted in 22 identified articles that underwent full-text assessment, of which a further 11 were excluded based on inclusion and exclusion criteria (Table 1). Finally, 11 articles were included in the systematic review (Supplemental Table S1).

Cohort specifics

Retrospective patient cohort sizes ranged from 10 to 337 (median = 91, mean =114), with 7 studies comprising smaller cohort sizes of less than 100 [11, 16,17,18,19,20,21]. All studies except for 2 analyzed single-center patient cohorts [11, 22].

Study endpoints of selected radiomic features

Study endpoints varied with selected RFs relating to OS in three studies [17, 23, 24] and to treatment response in two studies [19, 25]. Three studies analyzed both OS and progression-free survival [11, 21, 22], and two studies examined the treatment-related complication of radiation pneumonitis [18, 20]. One study measured RF changes in the NSCLC tumor before and during radiotherapy without association with any clinical endpoint [16].

Radiotherapy regimen

Applied radiotherapy methods varied, with intensity-modulated radiotherapy (IMRT) utilized in three studies, IMRT or stereotactic body radiotherapy in one study, volumetric modulated arc therapy (VMAT) used in two studies [16, 18, 21, 23, 25, 26]. One study employed stereotactic ablative radiotherapy (SBRT) in a subset of its patient cohort [23]. Four studies did not mention specific radiotherapy delivery methods [17, 19, 20, 22].

CT imaging protocol

CT vendor/ scanner type and scanning technique varied or were not disclosed in multiple aspects.

Regarding CT vendor and scanner models, 6 out of 11 articles mentioned the scanner type model, and out of these 6, 5 used a single CT scanner model.

Two studies used noncontrast cone beam CT images [17, 22].

Three studies used contrast-enhanced CT images [11, 21, 24], and the remaining 6 studies did not mention specific contrast phases [16, 18,19,20, 23, 25].

Three studies specified the respiratory cycle timepoint of image acquisition, with 2 at free breathing cycles [18, 20] and 1 at the end-expiratory phase [21].

Three studies did not specify the CT slice thickness [19, 24, 25], and 4 studies reported a CT slice thickness of 2.5 mm [11, 16, 20, 22].

One study each analyzed 1 or 2 mm [23], 1 or 3 mm [17], 2.5 or 3.0 mm [21], and 5 mm [18] CT slice thicknesses, respectively.

Radiomic feature extraction

RF extraction software was highly variable among the studies. Eight studies extracted features utilizing common software tools (1 AnalysisKit [23], 2 PyRadiomics [18, 20], 2 IBEX [16, 17], 3 MATLAB [11, 19, 22], 1 LIFEx [25]). One study employed an in-house software to extract radiomic features [21], and 1 study did not disclose the utilized software [5].

Radiomic feature selection

Full-text analysis scoring revealed a lack of similarities to identify common RFs given the variability of study endpoints (e.g., treatment response, OS, radiotherapy-related pneumonitis), along with differing data sets. Grey-Level Co-occurrence Matrix (GLCM)[11, 16-20; 22, 24, 25], first-order RFs intensity [16, 17, 20, 22, 23] and shape [17, 20, 22, 23], and higher order RF Grey-Level Size Zone Matrix (GLSZM) [18, 23, 25], were among the selected RFs described.

Model building

Model or nomogram building with non-RF parameters was described in 8 out of 11 studies [11, 18, 19, 21,22,23,24,25]. Available model/ nomogram performance varied, with three studies demonstrating borderline significant p values of 0.048, 0.049, and 0.046, respectively [11, 21, 23]. Most common utilized clinicopathological parameter for model building was smoking [18, 19, 21, 25], T- and N-stage [19, 21, 22, 25], with each factor observed in four studies, followed by tumor histology incorporated in three studies [19, 21, 22].

Supplemental Table S1 describes the articles’ detailed data extraction.

Objective 2: applying CLEAR and RQS point-scoring to selected articles (n = 11) and development of a comprehensive radiomics assessment checklist (CLEAR-RQS)

CLEAR metrics

The median CLEAR point score was 32.33 (55.74%, range: 25.33–48 [47.7–82.75%]). Across all three readers, all studies fulfilled the “manuscript preparation” CLEAR criteria of providing a title, abstract, keywords, introduction, and discussion. All articles failed to report details regarding the items “sample size calculation” and “flowchart for eligibility criteria”, and the entire domain of “open science.”

Table 3 summarizes the 44 items in detail where two or all three readers identified missing data pertaining to the respective CLEAR item.

Table 3 Missing data on CLEAR framework

RQS metrics

The median RQS point-score was 6.33 (17.59%) with a range of 0-16 points (0–44.44%) out of a maximal possible 36-point-score. Many criteria scored 0 or below by all readers as illustrated in Table 4, e.g., no study contained “phantom calibrations”, were “prospective studies registered with a database”, or performed a “cost-effective analysis”.

Table 4 Missing data on RQS framework

Comparing CLEAR and RQS point distribution

Table 5 demonstrates the point distribution for papers evaluated using the CLEAR and RQS criteria. Ranking differed for the top 3 articles when using the CLEAR versus RQS systems, for example, Chen et al [23] ranked 1st on the RQS but 4th according to CLEAR metrics, whereas Van Timmeren et al [22] ranked 1st on the CLEAR but 2nd according to the RQS framework.

Table 5 RQS and CLEAR scores and rankings

Figure 2 shows the score point values and respective ranking of appraised articles according to the CLEAR and RQS metrics.

Fig. 2
figure 2

RQS and CLEAR percentage score distributions of assessed radiomics articles in post-radiotherapy stage III/IV NSCLC (n = 11). Red bars representing the RQS, and green bars representing the CLEAR, frameworks. Numbers on top of the bars represent the RQS and CLEAR rank, respectively. The horizontal red bar delineates 50% percent highlighting that no RQS score was above 50%. Articles are listed in alphabetical order

Amalgamation of CLEAR and RQS items into a comprehensive assessment checklist (CLEAR-RQS) and comparing CLEAR-RQS with CLEAR and RQS

The 58 CLEAR and 16 RQS items’ wording was compared and identical or similar, resulted in the merging of items and the development of a 61-item CLEAR-RQS checklist (Table 2).

When applying the newly developed CLEAR-RQS checklist, the scoring percentage of each article was between its CLEAR and RQS score, with CLEAR-RQS adhering closer to the CLEAR checklist (Supplemental Fig. S1). This is easily explained, given that the CLEAR-RQS checklist contains 61 items, which is much more aligned with the 58-item containing CLEAR checklist compared to the RQS framework which only contains 16 items.

Discussion

This systematic literature review on radiomic features in post-radiotherapy stage III/IV NSCLC patients yielded 11 retrospective studies, exhibiting substantial variations in their study design, rendering them incomparable, and failing to identify an RF suitable for clinical translation. Moreover, there was low reporting quality when applying both the CLEAR and RQS frameworks, consistent with findings from other radiomics data reviews and meta-analyses [8, 15, 27, 28]. Merging the CLEAR and RQS frameworks into a comprehensive CLEAR-RQS checklist aimed to provide a comprehensive yet detailed guide for designing and critically appraising published research to the radiomics research community.

Limitations in radiomics study design

This review revealed several shortcomings in research design, potentially diminishing the generalizability and reproducibility of identified RFs.

The heterogeneity of study cohorts and relatively small sample sizes may limit comparability. Notably, two studies featured small sample sizes (n = 10, n = 23), rendering validation nearly unfeasible [16, 17].

Data harmonization, particularly image acquisition and reconstruction settings (referred to as “pre-processing” by CLEAR and RQS), emerged as a key requirement in radiomics research [29, 30]. Three studies did not disclose whether CT slice thickness harmonization was performed [19, 24, 25]. Body habitus, scanner models, and demographic parameters may influence radiomic analysis, necessitating their specifications for future validation [30]. This may require further data postprocessing to ensure reproducibility [29]. Two studies [17, 22] used cone-beam CT (CBCT) images, introducing challenges related to radiomic region-of-interest delineation caused by scattered radiation artifacts [31, 32]. Only three studies detailed the use of free breathing CT images [17, 18, 20], with the remaining studies neglecting to specify the CT acquisition breathing cycle point [11, 16, 19, 21,22,23,24,25]. Free-breathing studies introduce image blurring due to movement artefacts, acknowledged to impact radiomics analysis [33]. Consequently, RF extraction from inherently inconsistent or highly variable CT scanning protocols may compromise result interpretation and reproducibility.

Seven studies omitted reporting of image pre-processing resampling techniques and associated parameters [11, 16,17,18,19, 24, 25]. Eight studies failed to describe discretization methods [11, 16,17,18,19,20, 24, 25]. Image resampling, particularly downsampling and interpolating images in a manner that preserves spatial detail while avoiding overfitting, is critical for data harmonization. Shafiq-ul-Hassan et al demonstrated that resampling could reduce feature variability, therefore enhancing RF robustness [34].

Only 2 studies reported details of feature extraction segmentation of reliability analysis [11, 21]. Description of this step is important, as manual or semi-automated segmentation methods may introduce intra- and inter-observer variability, impacting reproducibility [35].

Certain categories of RFs, including first-order (intensity, shape) and higher-order (GLCM (Grey-Level Co-Occurrence Matrix), GLSZM (Grey-Level Size Zone Matrix)) groups, were more commonly investigated [11, 16,17,18,19,20, 22, 24, 25].

CLEAR and RQS metrics to assess the quality of radiomics research reporting

Item weighting

Assessing study quality depends on robust research design and comprehensive reporting of methodology, statistical parameters, and results. Both CLEAR criteria and RQS scores indicated suboptimal reporting quality, with variations in study rankings. No study fully met all CLEAR items, and RQS scores ranged from 0 to +16 points, less than 50% of the maximum achievable +36 points. Our analysis suggests that these assessment tools offer complementary critiques for identifying methodological challenges hindering the reproducibility and clinical application of radiomic results.

The CLEAR checklist offers a general guideline covering all aspects of the radiomics workflow, while the RQS framework comprises 16 criteria with varying weighted point scores. Certain domains, such as “prospective validation in an appropriate trial” (0 or +7 points) and “validation cohorts” (-5, +2, +3, +4, +5 points), are assigned more points compared to others. These items contributed most to top-scoring papers on RQS, which did not align with their CLEAR ranking. For instance, the RQS item “validation” negatively impacted the scores of Yang et al (3.67 points) [18], Shi et al (0.67 points) [17], and Zhang et al (0.67 points) [16], ranking them 7th, 10th, and 11th out of 11 articles, respectively. Such large point score disparities were not observed with CLEAR criteria, as exemplified by the comparison of Wang et al and Fried et al [21, 24]. With RQS, Wang et al ranked 5th (12.67 points) while Fried et al ranked 6th (6.33 points), whereas in the CLEAR metric, the point scoring disparity was less evident, and with Wang et al ranking lower (rank 7, 31.33 points) than Fried et al (rank 5, 35.67 points [21, 24].

A recently published quality scoring tool for radiomics research, METRICS (METhodological RadiomICs Score), has been developed by an international panel and has been endorsed by the European Society of Medical Imaging Informatics (EUSoMII). METRICS contains weighted items carefully selected and discussed via a modified Delphi process to ensure a balanced consensus among panelists [36]. This new point-scoring framework aims to facilitate critical appraisal of a broad range of radiomics research, from the manual data labeling and extraction to deep learning artificial intelligence (AI) pipelines.

Inter-rater variability

D’Antonoli et al’s study revealed that the RQS metric is susceptible to inter-rater biases, as its domains can be construed differently depending on raters’ backgrounds [9]. This corresponds to our findings, as our three raters – a graduate medical student, a junior radiologist, and a senior radiologist – exhibited minor discrepancies in RQS scores, which were reconciled through consensus. This variability aligns with prior research indicating low RQS scores and poor inter-rater reliability [9, 27, 28].

Creating a comprehensive CLEAR-RQS checklist to aid future education and research

Efforts aim to develop a robust tool for assessing radiomics research quality, with a focus on machine learning and other AI models [37,38,39]. The RQS and CLEAR frameworks specifically address radiomics methodology [10, 14], which has garnered attention from the Society of Nuclear Medicine and Molecular Imaging, the European Association of Nuclear Medicine [39], and the Scientific Editorial Board of European Radiology [40].

The herein presented CLEAR-RQS checklist, developed by an international research group from two academic tertiary institutions, aims to comprehensively evaluate radiomics methodologies, without sacrificing specificity. It integrates standards from both CLEAR and RQS tools, preserving their detailed wording catering to radiomics researchers, while also serving educational purposes across various disciplines. The application of a point-scoring system to the CLEAR-RQS checklist should be avoided, given the intricate complexities inherent in real-world research scenarios, which may not be granular enough to adequately capture the nuanced quality of the assessed research investigations.

In conclusion, stage III/IV NSCLC radiomics research suffers from suboptimal reporting quality, hindering the discovery of validated predictive RFs. Technical challenges and lack of access to source images and model files impede reproducibility. Thorough validation and open access to data and code are essential to increase transparency and raise reporting standards [41, 42]. Adoption of the CLEAR-RQS checklist could accelerate the translation of radiomics research into clinical practice. Furthermore, sustained multi-disciplinary collaboration for continuous assessment and improvement in this rapidly evolving field is required to ultimately benefit patient outcomes in personalized medicine.