Introduction

Radiomics is an analysis tool to extract information from medical images that might not be perceived by the naked eye [1]. Over the course of a decade, several thousand studies have been published spanning diverse imaging disciplines in the field of radiomics research [2]. Nevertheless, the inherent complexity of these advanced methods that are employed to extract quantitative radiomics features may make it difficult to understand all facets of the analysis and evaluate the research quality, let alone to implement these published techniques in the clinical setting [3]. It is evident that easily applicable and robust tools for assessing the quality of radiomics research are needed to move the field forward.

With the aim of improving the quality of radiomics research methods, Lambin et al [4] proposed in 2017 an assessment tool, the radiomics quality score (RQS). Following the ideal workflow of conducting radiomics research, the RQS breaks it down into several steps and aims to standardize them. As a result, the RQS includes 16 items covering the entire lifecycle of radiomics research. Since its introduction in 2017, it has been widely adopted by the radiomics research community, and numerous systematic reviews using this assessment tool have been published [5,6,7,8,9]. However, it can still be inherently challenging for researchers or reviewers to correctly interpret and implement RQS and, therefore, assign scores, which are reproducible; as a result, most of the time the RQS scores are defined with a consensus decision and without a reproducibility analysis in these systematic reviews [5,6,7, 10,11,12,13]. Importantly, no intra- or inter-rater reproducibility analysis was presented in the original RQS publication [4].

According to a recent review article on systematic reviews using the RQS, in most cases the RQS is being used in a consensus approach: 27 out of 44 review articles chose to use consensus scoring, 10 did not even specify how the final scores were obtained, and only 7 of them used intraclass correlation coefficients (ICC) or kappa (k) statistics to assess inter-rater reliability [5]. Despite the positive connotation of a consensus decision, this does not necessarily mean that a score reached by consensus is reproducible. A consensus decision might solely reflect the most experienced rater, as novice voices could be suppressed, resulting in an underestimation of disagreement [14]. The decision to use consensus rather than inter-rater reliability could also presumably be due to challenges in applying the RQS and because ratings cannot be reliably reproduced across raters. Evidently, there is room for improvement in establishing an easily usable and reproducible tool for all researchers.

In this study, we aim to perform a large multireader study to investigate the intra- and inter-rater reliability of the total RQS score and individual RQS items. We believe that a robust method for assessing the quality of radiomics research is essential to carry the field into the future of radiology, rather than ushering in a reproducibility crisis.

Material and methods

The study was conducted in adherence to the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) reporting guidelines [15].

Paper selection

We included studies published recently in European Radiology, within an arbitrarily chosen period of 1 month until the start of our study. The following search query is used: (“European Radiology”[Journal]) AND (“radiomics”[Title/Abstract] OR “radiomic”[Title/Abstract]) AND (2022/09/01:2022/10/20[Date—Publication]). European Radiology was selected because it is a first-quartile (Q1—Scimago Journal Ranks) journal with the highest number of radiomics publications among all radiology journals; e.g., a PubMed search with keyword “radiomics” or “radiomic” in article title/abstract returns 249 original radiomics articles between January 1, 2021, and December 31, 2022 (Fig. 1).

Fig. 1
figure 1

Bar graphs show the number of original radiomics articles published in first-quartal general radiology journals between 2021 and 2022

We only included original research articles and excluded systematic reviews, literature reviews, editorials, letters, and corrections. After applying the inclusion and exclusion criteria, a total of 33 articles were selected for the study, which was above the minimum required sample size, i.e., 30, for the inter-rater reliability studies based on Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research (Fig. 2) [16].

Fig. 2
figure 2

Study flow

Rater selection and raters’ survey

A total of 9 raters with different backgrounds and experience levels were recruited for the study with an open call within the European Society of Medical Imaging Informatics (EuSoMII) Radiomics Auditing Group. They all completed a survey initially, which was sent to all raters by email to determine their level of expertise in the RQS application as well as the level of expertise in their occupation. Then, they were randomly assigned to the following groups according to their level of expertise: two inter-rater reliability groups, including one with and one without a training session on the use of RQS, and one intra-rater reliability group (Table 1).

Table 1 Rater characteristics according to the level of RQS rating experience

The inter-rater reliability group with training (group 1) received a brief training session for the RQS assessment, during which they were instructed by an experienced rater (T.A.D.) about how to rate all items on a random article [17], and then, they separately completed the assessment of all 33 papers. The inter-rater reliability group without training (group 2) received no training at all on RQS and completed the ratings of all 33 papers. The intra-rater reliability group (group 3) received no training and was asked to score 17 out of 33 selected papers twice 1 month apart to minimize recall (Fig. 3). All raters provided their ratings as they read the article and their available supplementary material. A keyword search was also allowed if needed.

Fig. 3
figure 3

Study pipeline showing the different groups and their pathways in the study

At the end of the study, raters received another survey to investigate the challenges they faced during the RQS assessment and their possible solutions.

Statistical analysis

We used ICC (two-way, single rater, agreement, random effects model) for continuous variables, i.e., total RQS, and Fleiss’ and Cohen’s k statistics for categorical variables, i.e., item scores, as recommended [15, 16, 18]. Cohen’s k does not support to do comparisons of more than two raters/ratings, and Fleiss’ k should be used if there are more than two raters/ratings [19]. Therefore, Cohen’s kappa is used when there are two ratings/raters, i.e., group 3, and Fleiss’ kappa is used when there are more than two ratings/raters, i.e., groups 1 and 2, to compare [19]. We used two one-sided t-tests (TOST), a test of equivalence based on the classical t-test, to investigate group differences between mean RQS scores [20]. All statistical analysis was carried out with R software (version 4.1.1) and the “irr” and “TOSTER” packages were used [21].

Results

Paper selection

A total of 33 papers were included in this study. Two papers were technical papers, i.e., phantom studies, and all others were original research articles. The characteristics of included studies are shown in Table 2.

Table 2 Characteristics of included papers

Rater selection and raters’ survey

Raters were randomly assigned to groups based on the initial survey results (Table 1). After completing the assessments, raters were given another survey to explore the challenges they faced during the RQS assessment and their possible solutions. All responses can be found in Table E1. One of the main problems they faced was the confusion caused by the lack of clear explanations of the RQS items in the main RQS paper and in the RQS checklist [4]. A list of the major issues with RQS along with our recommendations for a simpler approach is presented in Table 3.

Table 3 The potential reasons for challenges and proposed amendments for radiomics quality score break down by items

Statistical analysis

Inter-rater reliability

The inter-rater reliability was poor between raters of group 1 (ICC 0.30; 95% CI [0.09–0.52]; p = 0.0015), and moderate between raters of group 2 (ICC 0.55; 95% CI [0.29–0.74]; p < 0.001), and remained low-to-moderate when comparing raters of groups 1 and 2 with the same level of experience (ICC 0.26–0.61). This trend was observed also for intra-group reliability analysis: Raters of group 1 showed poor inter-rater reliability and raters of group 2 moderate inter-rater reliability (Table 4).

Table 4 Results of the intra- and inter-rater reliability analysis for overall RQS

Intra-rater reliability

In the intra-rater reliability analysis, only rater 3, with intermediate experience level, showed moderate reliability between the first and second read (ICC 0.522; 95% CI [0.09–0.79]; p = 0.009), whereas rater 6 and rater 9, with advanced experience level, showed excellent intra-rater reliability (ICC 0.91; 95% CI [0.77–0.96]; p < 0.001 and 0.99; 95% CI [0.96–0.99]; p < 0.001, respectively).

Reliability of RQS items’ score

The inter-rater reliability for RQS items’ score reproducibility within groups 1 and 2 was very low. The only items that had high inter-rater reliability were items 3 (phantom study) and item 15 (cost-effectiveness analysis). All other items had poor to moderate inter-rater reliability. The intra-rater reliability of RQS items’ score was higher and most of the items had moderate to good intra-rater reliability, if not perfect. The mean value and standard deviation of k values for group 1 was 0.18 ± 0.33, for group 2 was 0.43 ± 0.3, and within group 3 for rater 3 was 0.7 ± 0.3, rater 6 was 0.75 ± 0.22, and rater 9 was 0.88 ± 0.27. Fleiss’ k for each RQS item of groups 1 and 2 and Cohen’s k for each RQS item of group 3 are summarized in Table 5.

Table 5 Results of the intra- and inter-rater reliability analysis for RQS item reproducibility

Moreover, we found that two of the 33 manuscripts included a self-reported RQS which was higher than the scores assigned by the raters in our study as reported in Table 3 [51, 52].

The mean RQS for group 1 was 10.2 ± 3.5 and for group 2 13.2 ± 4 and the mean RQS for group 3 first read was 12.23 ± 5 and second read was 12.4 ± 4.9 (Fig. 4). Two one-sided t-tests were applied between the mean RQS value obtained by readers of groups 1 and 2. The lower and upper bounds were calculated to have a statistical power of 0.8 with an alpha of 0.05. Thus, with a lower and upper equivalence bound of  ± 2.6 and a mean difference of  − 3.1, the p value for the lower bound was 0.7 and for the upper bound was  < 0.001 (Fig. 5).

Fig. 4
figure 4

Histograms and kernel density estimation plots showing the overall distribution of mean RQS separately (a) in group 1 (depicted in blue) and group 2 (depicted in orange) and (b) in group 3 first read (depicted in blue) and second read (depicted in orange)

Fig. 5
figure 5

Two one-sided t-test graph

Discussion

In this study, we conducted a multireader study and investigated the intra- and inter-rater reliability of total RQS as well as individual RQS item scores, involving readers with different experience levels regarding RQS rating. We found that despite being widely adopted, the RQS tool is not straightforward to comprehend and adopt, and its results may not be reproducible in many cases (inter-rater reliability ICC 0.30–055, p < 0.001 and intra-rater reliability ICC 0.522, p = 0.009 for total RQS; inter-rater group k − 0.12 to 0.75 and intra-rater group k − 0.40 to 1 for item’s reproducibility). Our results suggest that there is room for improvement to establish an easy-to-use scoring framework for authors, reviewers, and editors to assess the quality of radiomics studies.

To date, RQS has served as a valuable tool to fill the gap for guidance on the quality assessment of radiomics research. Similarly to Lambin et al [4], we believe that the quality of radiomics research should not be compromised, and researchers should transparently report their methods to ensure quality and reproducibility. In addition, to further advance the field, researchers should be incentivized to adopt open science practices. Nonetheless, any questionnaire or score intended for the evaluation of research or clinical practices should be rigorously evaluated for its reliability and reproducibility. To date, this has not happened for RQS even though it is widely used as a tool to assess the quality of radiomics research. Therefore, we believe that 5 years after its introduction, the RQS system should be updated to be more easily used by researchers, reviewers, and editors. Recently, a new reporting guideline has been published that covers all requirements, which are necessary to improve radiomics research quality and reliability [53]. We think our recommendations are also in line with this new guideline.

Interestingly, we found slight negativity of the training session that took place prior to the RQS application (according to the two one-sided t-test, groups 1 and 2 were not equivalent and statistically different with a lower and upper equivalence bound of  ± 2.6 and a mean difference of  − 3.1, lower bound p value = 0.7, upper bound p < 0.001). The raters of group 1 showed poor inter-rater reliability despite the training and group 2 showed moderate inter-rater reliability even though they have not received any instructions beforehand. Moreover, we observed the positive effect of more experience only in the intra-rater reliability analysis. The advanced raters showed perfect intra-rater reliability results, whereas the less experienced rater had moderate reliability. We have not observed an effect of experience in the inter-rater reliability analysis.

The raters indicated that the RQS instructions were not self-explanatory in most cases; therefore, they needed more time to interpret the RQS items and consecutively to assign a score. For example, item 4, i.e., “imaging at multiple time points,” was one such item that had low inter-rater reproducibility (k =  − 0.1 in group 1; k = 0.54 in group 2) due to unclear item definition in the checklist as well as in the article [4]. It could be argued that this refers to imaging at different time points within the same examination, i.e., imaging in the arterial/portal venous phase; inspiration/expiration; and test-retest. On the other hand, it could also be argued that this is a hint to longitudinal studies where imaging is performed at different time points, i.e., within 3 months, to perform a delta radiomics analysis. Also, the non-standard range of values, i.e., the sudden change from  + 1 to  + 2 to  − 7 to  + 7, caused confusion for the authors when assigning a score, without a proper justification of such non-standard range (e.g., for items 5, 12, and 16). A non-standard range would have been acceptable in the case of weighting the item scores according to their importance (Table 3).

One of the problems was that some of the items that may be unusual for the radiology workflow led to confusion instead of clarity. For example, some of the radiomics studies deal only with phantoms with an intention to cover technical aspects or to test the stability of radiomics features [54, 55]. In this case, an item dealing with phantom studies (item 3) might be a good idea, but in practice, clinical radiomics studies do not necessarily use this phantom step to stabilize their features and do not fulfill this item. Although the transferability of feature robustness from a phantom to a specific biological tissue in the setting of radiomics should still be demonstrated, technically focused phantom studies typically lack clinical validation and therefore tend to achieve lower scores in the RQS system. Similar issues were identified with item 15, which addresses cost-effectiveness analysis. This is very unusual for current radiomics studies, i.e., mostly retrospective, and rarely prospective let alone being included in a randomized controlled study. Also, the definition of cost for radiomics still represents a challenge and, to the best of our knowledge, no published cost-effectiveness analysis for radiomics exists in the literature [56]. Its value in terms of methodological quality could benefit from more research on the topic. Although items 3 and 15 were the most reproducible (Table 5), we argue that they create unnecessary clutter and had a limited impact on overall study quality, as they tended to be always absent (i.e., item 3 and item 15) based exclusively on the study aim or design.

Nowadays, more and more studies utilize deep learning for radiomics analysis; however, the current RQS tool mainly focuses on hand-crafted radiomics, and items specifically addressing the methodological challenges typical to deep learning approaches on radiomics are lacking. Consequently, robust and properly designed deep learning studies might be penalized with a low RQS total score merely because they fail to address questions that are relevant to deep learning methodology. Moreover, in the current RQS tool, sample size analysis or properly selecting the subjects is not rated. We think that sample size analysis and defining the study subjects could be included since study design is one of the most critical steps of a study [57].

We noted that some of the studies included self-reported scores in their publications, but, unfortunately, we found these to be an overly enthusiastic assessment, and observed a large discrepancy when compared with mean RQS results from our multireader analysis [51, 52]. It is not a new phenomenon that researchers tend to overestimate their results and report them within a rose-tinted frame of enthusiasm. This is just a cautionary note for reviewers, editors, and readers to aid correct evaluation of self-reported RQS scores based on our evidence.

Our study had some limitations. We only included a limited amount of papers, but according to the guidelines, it is still more than the minimum required sample size for the inter-rater reliability studies [16]. Moreover, we included articles only from European Radiology. However, in the field of medical imaging, European Radiology is the Q1 journal with the highest number of radiomics publications over the past 2 years, ensuring the quality of the studies from a selection of diverse radiomics research areas. In addition, although we intended to explore the effects of training in our study, we did not find any positive effects of training on the reproducibility of RQS. On the one hand, using only one paper as a teaching example might not be sufficient to capture a significant difference. On the other hand, a tool that requires extensive training, even among researchers in the field, to reach adequate reproducibility reveals the limitations of the RQS. Moreover, we have not investigated the effect of training for inter-rater reliability analysis; however, we think the effect of training might be too small to detect as we already found that the intra-rater reliability was moderate to excellent.

In conclusion, we have come a long way in the field of radiomics research, but on the long road to clinical implementation, we need reproducible scoring systems as much as we need reproducible radiomics research. We hope that our recommendations for a more straightforward radiomics quality assessment tool will help researchers, reviewers, and editors to achieve this goal.