Introduction

A survey among European Society of Radiology (ESR) members indicated a promising role for artificial intelligence (AI) in radiology, with over 50% of respondents using or considering its use [1]. Promises surrounding AI have been monumental, with the alleged enhancement of technical performance, detection, and quantification of pathologies to streamline radiologists’ workflows and improve patient outcomes [2,3,4,5]. AI has promised to generate value in image acquisition, preprocessing, and interpretation in various imaging modalities (CT, MRI, X-ray, and ultrasonography), but also in administrative radiology tasks that leverage generative AI.

This alleged value of radiology AI should, however, first be rigorously assessed before implementation. The general lack of knowledge regarding radiology AI’s added value has been reported elsewhere [2, 6, 7] and calls for a robust valuation framework to properly assess the value. The valuation must move beyond conventional metrics like sensitivity and specificity, delving into actual added value on a clinical level by considering among others patient impact, influence on clinical decision-making, workflow implications [8,9,10,11,12], and actual value for the patient.

In this paper, we present the Radiology AI Deployment and Assessment Rubric (RADAR), a framework designed to conceptualize the value of radiology AI across its entire lifecycle. Rooted within the widely endorsed concept of value-based radiology, RADAR emphasizes the centrality of patient outcomes [8, 13, 14]. Subsequently, we discuss various study designs that help to assess value in alignment with the distinct levels of RADAR.

Radiology AI Deployment and Assessment Rubric (RADAR)

The conceptual RADAR framework is depicted in Fig. 1. Table 1 provides a comprehensive definition of the various RADAR levels and links it to the various study designs discussed throughout this paper. RADAR is an adaptation of Fryback and Thornbury’s 1991 “Imaging Efficacy Framework” [10], designed to evaluate the efficacy of imaging technologies. It methodically progresses through seven hierarchical levels of efficacy, from specific to broader. Each efficacy level is foundational for the next: e.g., when technical efficacy (RADAR-1) is not ensured, progression to subsequent levels becomes redundant. We introduce the novel level of “local efficacy” (RADAR-7), underscoring the need for the valuation of an AI system within the local context. This is crucial, as insights from RADAR-1–6 might not translate universally across different healthcare institutions.

Fig. 1
figure 1

Overview of the RADAR framework. The outer circle depicts the RADAR efficacy level, and the inner circle provides its description. Abbreviations: AI, artificial intelligence; RADAR, Radiology AI Deployment and Assessment Rubric

Table 1 RADAR definition, illustration, and connection to the relevant study designs

We illustrate RADAR with the hypothetical case of a multifunctional AI system for stroke care. RADAR commences with technical efficacy (RADAR-1), constituting the prerequisite that the AI can consistently process and analyze CT brain images for subsequent tasks. Diagnostic accuracy efficacy (RADAR-2) is perhaps the most widely reported evidence type in AI literature. In our stroke example, this could pertain to the sensitivity and specificity of the algorithm in finding and highlighting large vessel occlusions. Both RADAR-1 and RADAR-2 are foundational measures, addressed before clinical implementation. Adequate diagnostic accuracy (RADAR-2) could allow for an impact on diagnostic thinking (RADAR-3) if the radiologist’s diagnostic workflow changes due to AI usage (for instance, when utilizing AI speeds up the stroke-diagnosis workflow). An impact on the therapeutic process (RADAR-4) occurs when, e.g., accurate and fast stroke diagnosis results in more thrombectomies performed. Efficacy in the first four levels culminates in actual patient outcomes (RADAR-5), which could in our example be measured as a reduction in long-term brain damage.

Whereas RADAR-1–5 are mostly clinically oriented, cost-effectiveness efficacy (RADAR-6) expands to incorporate wider considerations by contrasting costs with societal health benefits. Finally, the added level of local efficacy (RADAR-7) highlights the local adaptability and feasibility of the technology, for instance, the fit to the workflow of a specific hospital or stroke center.

RADAR-1 through RADAR-5: the assessment of clinical value

The first five RADAR levels predominantly pertain to clinical value, starting from technical efficacy (RADAR-1) and culminating in broad patient outcome efficacy (RADAR-5). The appropriate clinical valuation method should conform to the AI system’s objective, typically aligning with one of three primary aims: description, identification, or explanation [15].

Descriptive studies shed light on disease frequency without causal considerations [16], mostly irrelevant in radiology AI. Identification studies discern individuals with a disease (diagnostic) or those at risk (prognostic) [17], the first of which we focus on as it mostly pertains to radiology AI. In this light, we discuss the cross-sectional study and in silico clinical trial (IST) focused on RADAR-1 and RADAR-2. Finally, we also focus on explanation-based studies exploring causality and the mechanisms of the AI system’s impact. Against this background, we delineate the randomized controlled trial (RCT) and observational cohort study, related to RADAR-3 through RADAR-5. All discussed study designs are summarized in Table 2.

Table 2 Overview of the study designs for the assessment of clinical value in radiology AI (RADAR-1 through RADAR-5)

Cross-sectional study

In the valuation of radiology AI, cross-sectional studies—single-point-in-time studies that assess a specific variable or outcome without requiring long-term follow-up—serve as a useful design [16]. They could assess whether the AI system is technically efficacious (RADAR-1), e.g., in assessing the technology’s capabilities in accurately interpreting radiographic images. Cross-sectional studies could also measure the technology’s efficacy in diagnosing patients with a certain condition (RADAR-2), for instance, in terms of sensitivity and specificity in identifying lung nodules from CT scans. Cross-sectional studies are relatively fast and cheap, as they require only a single interaction with the study population and no time-consuming follow-up.

Their design does not afford a longitudinal perspective, limiting insights into the radiology AI’s performance over extended periods. Therefore, cross-sectional studies are less equipped for addressing the AI’s influence on treatment decisions or patient outcomes (RADAR-3 through RADAR-5), as these often necessitate a longitudinal design (e.g., RCT or cohort study).

In silico clinical trial

For radiology AI, there is commonly a big gap between retrospective proof-of-concept studies (RADAR-1) and a solution robustly evaluated in a clinical setting (RADAR-3 to RADAR-5). Retrospective studies in radiology AI generally focus on technical efficacy, while other aspects are equally crucial for trustworthy AI (e.g., fairness, usability, robustness) [18]. While RCTs are considered the golden standard to overcome this gap, conducting an RCT for every radiology AI is time-intensive and costly.

To this end, virtual or in silico clinical trials (ISTs) have been proposed. ISTs assess the initial viability and potential of a technology, functioning as a preparatory step for RCTs [19,20,21]. The main difference is that, instead of human subjects, digital data is used. ISTs are therefore easier to organize, less expensive, and have a lower entry level compared to RCTs. To ensure high levels of evidence before transitioning into RCTs, guidelines for ISTs are continuously evolving and becoming stricter to mimic RCTs as closely as possible.

In addressing technical efficacy (RADAR-1), ISTs might for instance be used to simulate the extent to which an AI technology can process X-ray scans in bone fractures. Moreover, they can offer insight into diagnostic accuracy (RADAR-2), for instance, through modeling the technology’s proficiency in finding long nodules in chest CTs. Furthermore, since ISTs can emulate various clinical situations, they could for example mimic the AI recommendation’s influence on the radiologist’s assessment of detecting irregularities in initial breast mammograms, addressing its influence on clinical management decisions (RADAR-3) before more comprehensive examination in a later-stage study. Pending further advancements, prospective ISTs could theoretically also address RADAR-4 and RADAR-5.

The idea of ISTs for healthcare, already proposed in 2011, has only recently been applied to radiology AI, largely due to the challenges in digital data generation [22]. As the assumption of ISTs is that the results of digital data generalize to real patient data, the generation of representative and realistic digital data is crucial for the validity of ISTs. Two prevailing approaches are virtual patient generation from compiled datasets and the use of digital twins mimicking individual patients [23, 24]. Technological advancements have eased data simulation and improved generalization to real patient data. Yet, each method requires specific developments and stringent quality control for accurate representation.

Randomized controlled trial

RCTs are underrepresented in (radiology) AI [25, 26], which aligns with the absence of careful value assessment [26]. RCTs are widely recognized as the gold standard in evidence-based medicine and could strongly benefit radiology AI valuation. In terms of diagnostic thinking efficacy (RADAR-3), RCTs investigate if there is a shift in the radiologist’s diagnostic process and whether such changes yield measurable improvements. Regarding therapeutic efficacy (RADAR-4), RCTs measure the effect of AI on treatment strategies, such as how AI assistance in interpreting images affects the final choice of treatment. In terms of patient outcomes (RADAR-5), they might finally evaluate direct patient outcomes, such as whether the AI-guided intervention results in improved survival rates.

To draw adequate causal conclusions, researchers must maximize internal validity. RCTs are highly regarded due to their robust internal validity, which is maximized when selection bias, information bias, and confounding bias are mitigated. Selection bias (i.e., the relation between inclusion in the study and exposure assignment) is minimized through exposure assignment after individuals are included in the study, information bias through (double) blinding, and confounding bias through randomization, which ensures a balanced distribution of potential confounders across the exposed and unexposed arms. While RCTs boast strong internal validity, their external validity (or generalizability) can be a concern due to strict eligibility criteria, which limit applicability to certain populations outside the controlled setting. Improving external validity in RCTs is challenging and would generally rely on replicating the study with a wider scope of patients (e.g., through a multicenter approach).

Cohort studies

In a systematic review on AI in clinical radiology, 98% of clinical questions were approached with (retrospective) cohort studies, making them easily the most employed study design [27]. Cohort studies investigate associations between intervention and outcomes over time, with participants compared by exposure status. Although fundamentally longitudinal, a single measurement instance can also facilitate a cross-sectional study, allowing for both explanatory and identification-based research questions to be addressed.

While RCTs are often considered the gold standard, cohort studies provide a viable alternative. Opposed to the high costs and limited duration of RCTs, cohort studies can follow larger populations over extended periods, focusing on the long-term effects of AI on patient health outcomes (RADAR-5). Cohort studies often allow for a large study population, resulting in strong external validity.

In conducting a cohort study, one must however address potential threats to internal validity, including selection, information, and confounding biases. Yet, with careful design and analysis, these issues can be anticipated. Emergent analytic techniques, such as instrumental variables (like Mendelian randomization in genetics), generalized methods (e.g., g-formula, structural models), and target-trial emulations offer accurate causal inferences. For instance, target-trial emulation can simulate an RCT within a cohort study, offering insight into AI impacts without the necessity to repeatedly conduct expensive and time-consuming RCTs.

RADAR-6 and RADAR-7: the assessment beyond clinical value

Health economic evaluations

Health economic evaluations (HEEs) are vital in understanding the (societal) financial feasibility of health technologies, yet are notably scarce in medical AI [28, 29]. HEEs contrast costs and health outcomes of two or more technologies, such as comparing an AI technology with the standard of care [30]. Costs encompass direct expenditures like purchasing, licensing, and training costs, as well as indirect costs such as productivity loss and informal care costs. Outcomes are typically patient (health) outcomes such as quality-adjusted life-years (QALYs) (RADAR-5), customarily obtained from RCTs, observational studies, modeling, or a combination thereof. HEEs can be leveraged to address cost-effectiveness efficacy (RADAR-6), moving beyond only clinical effectiveness [30].

Table 3 displays three common HEE methods. Cost-minimization analysis (CMA) is utilized when there is sufficient reason to believe that the AI system does not improve (clinical) outcomes but has the potential to reduce costs due to improving the clinical-diagnosis workflow. Cost-effectiveness analysis (CEA) may be utilized if the AI system has the potential to improve the clinical outcomes of patients, providing insight into the ratio of (improved) clinical outcomes and costs, captured in the incremental cost-effectiveness ratio. Cost-utility analysis (CUA) is finally similar to CEA, except that clinical outcomes are measured in quality-adjusted life years, so as to ensure standardized comparisons of technologies across healthcare fields.

Table 3 Overview of the health economic evaluation study designs for the assessment of cost-effectiveness efficacy in radiology AI (RADAR-6)

Budget impact analysis

Efficacy determined in the initial RADAR levels may not generalize to every local context, necessitating an evaluation of how well the value identified in RADAR-1 through RADAR-6 translates. For instance, local variation in workflow, population composition, and IT infrastructure can all affect the ultimate value of an AI technology and thereby acquisition [31]. It is thus vital to customize AI valuation to align with the specific features and requirements of the local healthcare settings captured in RADAR-7.

To address local financial feasibility, budget impact analysis (BIA) evaluates the AI by considering local budgetary constraints and local population composition (Table 4) [32]. A comprehensive BIA accounts for locally estimated costs, including acquisition, maintenance, training, and workflow adaptation. This provides valuable insights into the AI’s affordability and sustainability for local radiological practices and aids in optimal decision-making during the acquisition phase.

Table 4 Overview of the study designs for the assessment of local efficacy in radiology AI (RADAR-7)

When performing a BIA, it is crucial to consider not only the financial implications of implementing a new technology, but also who shoulders the costs and who reaps the benefits. While an AI tool may boost one department’s efficiency, its funding could come from another department. For instance, an AI tool used for early stroke detection may be funded by the radiology department. While the radiology department incurs the costs of purchasing and implementing this AI tool, it could be the neurology department that mostly benefits from the improved diagnostic capabilities due to an increase in efficiency and better patient outcomes. This could occur without any increase in their department’s expenditures, which could result in disagreements over funding responsibility between departments. Understanding these budget dynamics is therefore essential when assessing AI value and increasing adoption, as BIA concerns not only the total cost, but also how these are distributed.

Multi-criteria decision analysis

Whereas the previously discussed methods mostly focus on clinical outcomes and cost-effectiveness, a broader approach to the valuation of local efficacy (RADAR-7) allows for including not only medical and economic considerations, but also legal, social, and ethical ones, the last being particularly relevant in radiological AI [11, 14, 33,34,35]. Examples of broader issues are usability (how easy is it to use the AI technology), regulation (how well does the AI technology conform with local regulatory guidelines), explainability (to what extent is the AI system’s decision understood by the radiologist), etc.

While crucial in valuing radiology AI, these issues are difficult to operationalize and quantify through the previously discussed methods. Multi-criteria decision analysis (MCDA) offers a solution by facilitating a comparison of a highly diverse range of issues [36]. An MCDA of an AI tool would involve local stakeholders (1) to identify the key criteria including patient outcomes, cost-effectiveness, ethics, social concerns; (2) to score the AI technology against these criteria; and (3) to calculate an aggregate score for informed decision-making and acquisition. This allows for a broad health technology assessment perspective and ensures alignment with local requirements and constraints, effectively addressing local efficacy (RADAR-7).

Prospective monitoring

Prospective monitoring is vital in maintaining long-term relevance and efficacy at the local level (RADAR-7). Earlier work advocated a structured three-phased approach for successful local AI integration [37, 38]. Initially, the AI operates in “shadow mode,” allowing for safety assessments without affecting clinical decisions. This is followed by a small-scale workflow test, gathering valuable feedback from involved clinicians. In the final stage, the AI becomes fully operational, necessitating ongoing monitoring. This continuous oversight helps counter challenges such as “model drift” [39, 40], where variations in new data inputs could compromise AI performance. Given the comprehensive yet time-consuming nature of RADAR, and especially in lengthy study designs like RCTs, model drift could erode the study’s relevance by its conclusion. Conducting meticulous planning and post-implementation prospective monitoring is therefore essential.

Discussion

The RADAR framework is positioned to progressively value radiology AI through seven hierarchical efficacy levels and has been adapted from Fryback and Thornbury’s (1991) imaging efficacy framework [10]. We have expanded this original framework by tailoring it to radiology AI, adding a local efficacy level and connecting the levels with various study methodologies. We thereby provide radiologists and researchers with a framework that helps to conceptualize the valuation of radiology AI throughout the entire lifecycle. Local decision-makers can moreover use RADAR in making well-founded, evidence-based decisions in the acquisition of radiology AI.

While we predominantly showcased RADAR through examples focused on improving patient health outcomes, it is important to note that many AI systems target non-clinical tasks, e.g., the automation of routine administrative work with large language model technologies. RADAR is also positioned to address such AI systems. While in these examples cost savings (RADAR-6) are likely to be most relevant, influence on other RADAR levels is not exempt. For instance, the reduced administrative load could indirectly influence diagnostic thinking (RADAR-3) by granting radiologists more time for precise diagnoses, which could progressively influence the higher RADAR levels. Radiologists and decision-makers should therefore hierarchically progress through all RADAR levels when ascertaining value. Nevertheless, this process is likely to be faster for technologies focused on administrative tasks such as the aforementioned.

Several frameworks have previously been suggested for valuing radiology AI. The international FUTURE-AI consortium has formulated broad principles with an accompanying checklist to guide developers towards creating safe and trustworthy radiology AI [18]. The Canadian Association of Radiologists [41] and Park et al. [42] proposed guidance on addressing technical performance (RADAR-2). Omoumi et al. offered a more comprehensive checklist, assessing the value of radiology AI technologies through a wider array of concerns [43].

RADAR is unique in that it accounts for different valuation needs throughout the radiology AI lifecycle. For instance, early proof-of-concept technologies would mostly require technical efficacy (RADAR-1) and diagnostic (RADAR-2) efficacy to confirm their foundational capabilities. In contrast, further developed technologies, for which cost-effectiveness has been proven (RADAR-6), require local value assessment or a prospective monitoring plan (RADAR-7) to ensure their broader utility translates locally. RADAR is thus contingent on the state and valuation need of the specific technology, which is vital as this changes throughout the radiology AI lifecycle.

In conclusion, RADAR constitutes a conceptual framework for the valuation of radiology AI throughout its lifecycle. It initiates with technical performance at the technology’s conception (RADAR-1) and incorporates increasingly broader valuation, ultimately resulting in the assessment of generalizability to the local context (RADAR-7). Progressing hierarchically through the seven levels, RADAR constitutes a comprehensive valuation framework, positioned to bridge the implementation gap in radiology AI.