A rapid growth in artificial intelligence (AI) applications in radiology has encompassed many aspects, from medical imaging interpretation to workflow management and decision-making. [1] These applications potentially improve diagnostic accuracy, reform the radiology workflows, and enhance patient risk stratification. [2] Although the market may soon consolidate, for now, commercial interest is still on the rise, and the list of healthcare AI companies continues to expand.

As radiology AI evolves to full integration into the daily routine, it is paramount to ensure its proper function. Hence, there is a growing need for a more critical appraisal of AI applied to patients.

Trials involving control and intervention groups seem to date to the beginning of historical records. [3] James Lind’s publication of a controlled trial in 1753 demonstrating citrus fruit efficacy in scurvy was a cornerstone for the acceptance of the methodology of comparative trials. [3] Randomized controlled trials (RCTs) remain one of the most powerful tools in research. [4] No other study design has such an ability to balance unknown factors that can influence clinical course. [4]

Many methodological shortcomings were found in a review that compared the diagnostic accuracy of deep learning algorithms to healthcare professionals in classifying diseases using medical imaging. Most studies lacked external validation and performance comparison on the same samples, limiting diagnostic accuracy. [5]

As AI becomes an integral part of radiology workflow, critical evaluation of AI through methodologically strong tools is crucial. We reviewed RCTs assessing AI systems in radiology. Our review adhered to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Our search terms complied with AI, radiology, and RCTs. We used the paper by Stolberg et al [4] regarding RCTs as guidance for inclusion. The PubMed search was conducted in September 2021.

Our search produced a total of 195 entries. Forty-nine results were unrelated to clinical practice, and 17 did not use AI. Eight results were from gastroenterology, 13 from ophthalmology, and 10 from cardiology. Twenty-three entries were research regarding clinical prediction models and not imaging. Only 64 entries were related to radiology. Only one paper, “Artificial Intelligence Algorithm Improves Radiologist Performance in Skeletal Age Assessment: A Prospective Multicenter Randomized Controlled Trial,” published in Radiology, was a randomized controlled trial. [6] A PRISMA flow diagram of the literature review is presented in Fig. 1.

Fig. 1
figure 1

PRISMA flow diagram

Hence, to date, there is a single RCT in the field of AI in radiology. In their experiment, Eng et al [6] compared the effect of an AI diagnostic aid on the assessment of skeletal age on hand X-rays. They have shown that their AI-based algorithm improved accuracy and interpretation times. [6]

Thus, despite the dramatic increase in deep learning publications in the radiology field, there is only one relevant RCT publication. The understanding that interventions involving AI need a prospective evaluation to prove their impact on health outcomes brought to the development of specialized guidelines such as the CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence). Another example is the SPIRIT-AI (Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence). [7] A recent systematic review of RCTs of machine learning interventions in healthcare found a lack of RCTs with only 41 trials. Not a single RCT adhered to all CONSORT-AI guidelines. [8]

The recognition of the hazardous potential of AI systems is of paramount importance. In contrast to other health interventions, unpredictable and undetectable errors, not explainable by human logic, can occur. For instance, minor changes in medical imaging that are invisible to the human eye may completely alter diagnostic results. Another well-known example is the decision support system that provided incorrect and sometimes even dangerous treatment recommendations. [9] Thus, RCTs can be important in monitoring these systems’ safety.

In a recent review published in European Radiology, Kelly et al [10] discussed some critical methodological issues. The lack of explainability in 28% of deep learning clinical radiological papers is worrisome. So is an average decrease of 6% in performance at external validation with a drop of more than 10% in 78% of the studies. They also found problematic study designs such as insufficient sample size and unspecified ground truth. Lack of performance comparison was found in 17% of the reviewed studies, with a set of studies using medically naïve people for comparison.[10] Their findings are concerning and urge improvement of research quality. Using international data for external validation may pose the first step, but RCTs are a stronger tool.

There are several explanations for the lack of RCTs in this field. Firstly, some may believe the conventional tools to estimate an AI system are sufficient. External validation, comparison of the performance of a model compared to a previous one, and mathematical variables such as the area under the curve are often used in assessing performance. Secondly, RCTs for continuous learning algorithms are complicated to design and interpret. Thirdly, controlling multi-centric variability is difficult. Fourthly, creating a study design for bundled AI tools such as a post-processing algorithm with incorporated segmentation is not a simple task due to the interaction between the algorithms. Lastly, setting up an RCT is often challenging and time-consuming compared to alternative study designs.

Nevertheless, RCTs remain the most powerful type of experimental study. [4] In light of the AI revolution in radiology, we believe the time has come for RCTs and encourage further research in this important field.