Introduction

The “Artificial Intelligence (AI) winter” [1] is over. AI and radiomics approaches applied to medical images for the non-invasive characterisation of diseases (i.e., image mining) have remarkably increased in recent years. The first reports on AI and radiomics applied to medical images date back to 1963 [2] and 1973 [3], respectively, but the enthusiasm of those years broke off quite soon. Recently, increasing amounts of electronic medical data, technological improvements, and health sustainability issues resulted in a renewed interest in both AI and radiomics applications.

Image mining is claimed to have a potentially huge clinical relevance with the possibility to non-invasively diagnose, characterise and predict the outcome in almost all medical conditions. However, despite the amount of published studies, some issues including significance, goodness, and strength of the reported results are still to be addressed. Particularly, it is not clear how far image mining is from clinical practice.

Therefore, the aim of this systematic review was to analyse literature on AI and radiomics, including all medical imaging modalities, for oncological and non-oncological applications, in order to assess how far the image mining research stands from routine medical application. To do this, we applied a trial phases classification inspired from the drug development process.

Material and methods

This systematic review was performed according to the PRISMA statement [4]. The PRISMA checklist is provided in Supplemental material.

Search, eligibility criteria and study selection

The endpoint of the analysis was to assess the potential of AI applied to medical images and radiomics (i.e., image mining) to be implemented in clinics. Our search algorithm within the PubMed/MEDLINE database consisted of the combination of the following terms: “Artificial intelligence[Mesh]” OR “Radiomic” OR “Radiomics” AND/OR “Radiography[Mesh]”, OR “Ultrasonography[Mesh]” OR “Tomography[Mesh]”, OR “Image interpretation, computed-assisted[Mesh]”, OR “Multimodal imaging[Mesh]”, OR “Tomography, emission-computed[Mesh]”, OR “Photography[Mesh]”. No start date limit was used, and the search was extended until September 5th, 2018.

According to the scope of the review, we considered AI (i.e., algorithms that take images as inputs) and radiomics investigations aimed at relevant objectives in clinical practice: biological characterisation, risk stratification, treatment response prediction, toxicity prediction, and prognostication of a certain disease. The imaging modalities we considered were ultrasound, radiography, mammography, endoscopy, skin pictures, ocular fundus pictures, computed tomography (CT), magnetic resonance imaging (MRI), scintigraphy (either planar images, SPECT, or SPECT/CT) and positron emission tomography (PET) or PET/CT. Subsequently, we applied the following exclusion criteria: (a) articles not in the English language; (b) studies not within the field of interest; (c) guidelines, review articles and meta-analysis, editorials or letters, comments, and conference proceedings; (d) “in vitro”, phantom or animal studies; (e) case reports or small case series (≤ 10 patients); (f) studies involving healthy subjects; (g) research articles focused on methodological aspects (algorithm and/or software development and/or comparison; evaluation/comparison of method(s) for parameters optimization, segmentation and features extraction; test–retest studies); (h) testing data (not medical images) as input for AI algorithm(s); (i) radiomics studies evaluating descriptors of shape and size or image intensity histogram only (i.e., not textural features); (j) lack of validation in a clinical setting; (k) lack of conventional metrics (i.e., sensitivity, specificity, accuracy, and/or hazard ratio, and/or recall, and/or AUC, and/or C-index) for the report of validation results.

Two reviewers (MK and MS) independently performed an initial screening of the identified titles and abstracts applying the inclusion/exclusion criteria. The discrepancies were resolved by a third reviewer (LA). The decision rule for consensus was simple majority. Then, the reviewers retrieved the full-text reports of the selected abstracts and, subsequently, performed an independent second-step selection.

Quality assessment of the literature

Quality assessment of selected papers was performed according to the QUADAS-2 criteria, assessing 4 domains: (1) patient selection, (2) index test, (3) reference standard, and (4) flow and timing [5]. The signalling questions for each QUADAS-2 domain were tailored for the aim of this review as detailed in Table 1. This evaluation assigned the risk of bias to a study and ranked it as low (score = 2), high (score = 1), or indeterminate (score = 0) for each domain. We calculated the overall QUADAS-2 score as the sum of the scores. The appropriateness of statistical analysis was defined considering two aspects. First, the total number of patients analysed was considered appropriate if at least five patients/feature (after feature selection, if performed) were included in a radiomics study; while AI studies with more than 50 patients were considered as acceptable quality. The sample size criterion for radiomics studies was used adapting the conventional rule for multiple regression: the number of data points (i.e., observations or cases) should be considerably more than 5–10 times the number of variables [6]. At least ten patients per feature have been recommended in radiomics studies [7, 8]. The sample size criterion for AI studies was established assuming that at least 50 patients are needed to train and validate an algorithm, minimising the effects of overfitting and improving the quality of performance metrics, similarly to what is recommended for biomarker discovery [9]. Second, we assessed the balance in the number of patients between the subgroups (e.g. number of patients with benign vs malignant lesions in a study aimed at differential diagnosis); an imbalance of more than 2/3 was considered inappropriate.

Table 1 Description of the QUADAS-2 criteria used for the qualitative assessment

Phases classification criteria

We developed the phases classification criteria for image mining studies, inspired by the classification applied to the clinical trials (Fig. 1). The parameters for phase categorisation included: sample size, type of study (retrospective/prospective), type of validation approach (internal/independent), and the development stage (pre-/post-marketing). Figure 1 reports the classification criteria in detail. We assigned each selected article to a phase: from 0 to IV.

Fig. 1
figure 1

Trial phases. Trials classification for the drug development process (a) and for the proposed image mining tools development process (b). PK pharmacokinetics, PD pharmacodynamics

Statistical analysis

A database was created for the qualitative synthesis of the papers and the studies’ results. We recorded the results obtained in the validation set only. In case of multiple aims within the same article, the primary one was considered. If image mining was applied to different imaging modalities within the same article (e.g., PET and CT), results were recorded for each one. If different approaches were tested within the same paper (e.g., models derived from random forest and support vector machine), the best in terms of diagnostic performance was analysed. The articles reporting identical or very similar sample size, research hypothesis, methodology and results sharing with almost the same authorship—“salami” publishing [10]—were excluded to avoid overlap in the study population and results redundancy. Accordingly, in case of a series of articles considered as “salami”, the one with the larger sample size was evaluated and the other(s) excluded. Papers with a QUADAS-2 ≥ 7 were included in the quantitative analysis. Descriptive statistical measures were used to summarise the data. Excel ® 2017 (Microsoft®, Redmond, WA) was used for analysis.

Results

Search, eligibility criteria and study selection

Overall 34,626 articles were retrieved using the search algorithm. Subsequently, 33,997 papers were excluded reviewing titles and abstracts and applying the inclusion/exclusion criteria. Five papers, suspected to be a “salami” publishing, were excluded. Figure 2 summarises the research process. Table S1 reports a qualitative summary of the selected 300, including the 171 high quality articles. In recent years, a striking increase in the number of papers published on image mining occurred. In fact, especially in 2017 and 2018, 66 (22%) and 131 (44%) articles, respectively, were published. Figure 3 shows the literature trend. The vast majority (more than 80%) come from the oncology field. However, more than 50% of the studies included in the qualitative analysis have been assigned a high risk of bias (Fig. 4). Accordingly, a substantial proportion 129/300 (43%) studies have been scored as having a considerable risk of bias, mainly in the “index test” and “patient selection” domains (QUADAS-2 ≤ 6) (Fig. 5). Table S2 reports a qualitative summary of the 171 high-quality (QUADAS ≥7) papers. The temporal trend of the literature according to the phase of the study is shown in Fig. 6. Figures 7, 8, and 9 represent the graphical syntheses of the high-quality articles (QUADAS-2 ≥ 7) considered from three different points of view: the clinician, the imager, and the researcher. Quantitative synthesis is summarised in Table 2. The main results of the phase III studies are reported in Table 3.

Fig. 2
figure 2

Study selection workflow

Fig. 3
figure 3

Trend of the published studies on artificial intelligence (AI), radiomics and the combined approaches radiomics/AI

Fig. 4
figure 4

Trend of literature on image mining according to QUADAS-2 score, considering 300 selected studies

Fig. 5
figure 5

QUADAS-2 assessment results. Distribution of the articles tabulated by the four QUADAS-2 domains for the 300 studies selected applying the inclusion/exclusion criteria (a) and for the 171 studies scored ≥7 (b)

Fig. 6
figure 6

Trend of literature on image mining according to trial phases classification, considering 300 selected studies (a) and the 171 high-quality studies (b)

Fig. 7
figure 7

Radiomics and artificial intelligence literature summary by disease and clinical setting

Fig. 8
figure 8

Radiomics and artificial intelligence literature summary by image mining approach and imaging modality

Fig. 9
figure 9

Radiomics and artificial intelligence literature summary by image mining approach and phase classification

Table 2 Quantitative synthesis of the 171 selected articles
Table 3 Summary of the results of the phase III trials on image mining (n = 8)

Discussion

The present systematic review is the first assessing the potential for implementation of image mining tools in clinical practice, by means of classification of the literature in development phases. Despite the amount of literature on image mining with a validation analysis, more than 90% of studies were classified as phase 0, I or II (i.e., retrospective). Collectively, their results were uncertain in terms of significance, goodness, and strength and their generalisability weak. Even among the studies with a QUADAS-2 ≥ 7, only 4.6% were categorised as phase III studies. As it emerges from the present systematic review, the results are promising but still not mature enough for clinical implementation and widespread use of image mining tools. Nonetheless, the study quality has increased in recent years.

Because of the paucity of phase III and IV studies, we did not proceed to a meta-analysis. Therefore, no definitive conclusion can be drawn on which approach among radiomics and AI should be preferred. AI techniques, in particular convolutional neural networks, have the advantage over radiomics of not requiring tumour segmentation, feature calculation and selection. These steps are even more critical in tiny lesions that have to be submitted to radiomics processing. On the other hand, vast cohorts are crucial for a robust AI-based model development that require big efforts to be collected and analysed. Also, an unbiased reference standard, not always easy to obtain, should be chosen to ensure AI model reliability. The combined radiomics/AI strategy is at its early stages [179,180,181,182,183] and the complementary role of radiomics and AI techniques should be addressed [184]. Which is the best image mining approach is still an open question.

Similar to the other “omics” domains (e.g., proteomics), few of the image mining biomarkers reached clinical practice [9]. The translation of image mining research in the clinical arena is limited by the huge variability of the methods used for image analysis, together with the impasse to reproduce the results when tested in a different cohort of patients. Validation is a critical issue. Theoretically, the validation analysis of a successful model should provide consistent performance measures to those obtained in the training process. Thereafter, results obtained in the validation cohort should be confirmed by the test-independent validation. Finally, the proposed approach should be effective for the indication within the “real world” population of patients the model has been developed for. The validation process may be internal (e.g., cross-validation and bootstrapping) or external (using data not used for training). Typically, the internal one, used for a preliminary evaluation or for the fine-tuning of the model under development, overestimate the performance [185]. In fact, the same cohort is used twice, once to choose the filtered subset and again to build a classification model resulting in the overfitting of the algorithm to the data [9]. The external validation may be performed using three different strategies: (i) temporal (i.e., data obtained in newly recruited patients), (ii) geographic (i.e., data collected in a different institution), and (iii) split-sample (i.e., data split from the entire dataset and kept untouched for the test). External validation is crucial to verify the generalisability of the results [185]; and the random patient selection is an essential prerequisite, as well as the balance in patient characteristics. Temporal or geographic validations should be preferred to the split-sample one. Particularly, the geographic validation, which accounts for technical variability aspects (scanners, acquisition parameters and protocols) [185], is expected to be more representative of the clinical setting.

We excluded a priori from the present analysis studies testing shape and size as well as histogram-based features since our aim was to assess the “maturity” of advanced image analysis at its’ full potential, entailing textural indexes derived from up to second-order features. Image analysis based on the gray level histogram only does not provide any information about the relative position of pixels/voxels to each other within the region of interest. Therefore, these features are not able to describe whether any low/high gray levels are positioned together, or if they are distributed between high/low-value gray levels [186].

In order to develop a valid and trustworthy image mining tool the cohorts in study (training, validation and test) should be representative of the target population. This means that the sample size should be big enough to minimise the effects of overfitting, be comprehensive of the “outliers”, and, consequently, be reliable when used for the assessment of unseen patients. We proposed that at least 50 patients should be included in AI studies, as also suggested by simulated analyses [9, 187, 188]. However, especially for deep learning approaches and complex tasks, much larger populations are needed. The effect of the sample size on the model performance has been already demonstrated. When a limited dataset (1000 samples) vs the complete dataset (>100,000) was used for retinopathy classifier development, the weighted error resulted as 12.7% vs 6.6%, respectively [33]. Moreover, the sample size was not the only criterion used to score papers. Accordingly, even if the number of patients was relatively low, only well-designed studies have been scored as having low risk of bias and included in the quantitative analysis.

In the present work we arbitrarily chose 100 samples as the threshold for trial phases categorisation (II vs III ad above). This was a conservative decision. The conventional rule of the necessity to include at least ten patients per tested variable if applied for phase classification would have led to downgrading of most of the studies. Considering the fact that generally a feature reduction strategy was put in place we chose 100 as a reasonable cut-off. When planning an image mining study, both statistical recommendations (sample size and prospective design) and clinical conditions’ variability should be considered in order to develop an algorithm on a dataset that realistically represents the target patient population. The sample size calculation has been estimated in only 3 out of the 171 papers [44, 88, 125]. The use of multiple images taken from the same patient should be limited since they are prone to be similar (or almost identical) with a negative impact on the generalisability potential. In fact, even if this process increases the sample size, the multiplied number of observations (or cases) not representative of the inter-patient variability, overestimates the model performance. Accordingly, data augmentation should be properly used to avoid overfitting, keeping in mind that it cannot completely overcome the requirement of a proper sample size.

To assess the appropriateness of the statistical analysis, we considered not only the sample size but also the number of patients within the subgroups (an imbalance of more than 2/3 was considered inappropriate). Imbalanced cohorts in image mining studies may lead to constitutively biased results, which confer higher uncertainty and poor generalisability [189]. Consequently, in the developmental phase, proper study design and analysis strategy using stratification, matching, weighting, covariate adjustment, or regression should be adopted. Therefore, imbalance per se does not prevent the use of radiomic and AI-based approaches when the prevalence of a disease or an outcome is very low [190].

The applicability of the image mining framework to rare diseases is still an issue because of the limited data availability for model development. Transversal platforms for sharing and analysis of images and data, as envisaged by some research groups [191], could represent a valuable strategy for the investigations in this field.

Presently, it is unrealistic to justify a medical decision by the output provided by a neural network or a radiomics feature/signature. Little or nothing is still known on the biological significance of the image-derived parameters. Correlations with tumour grading [39, 45, 46, 94, 100, 117, 118, 138, 174], inflammatory infiltrate [131], gene expression, mutation and molecular pathways [25, 44, 47,48,49,50,51,52,53,54,55,56,57,58, 85,86,87, 101, 121, 122, 139,140,141,142, 180] have been reported. Nonetheless, more should be learned about the functioning of AI and radiomics approaches in order to solve the “black box” problem and to understand the underlying clinical and/or molecular connotation. Imagers should be able to assess the reliability of image mining approaches and to manage independently the patient (i.e., the pilot, plane and passengers during a flight). This innovative attitude, which implies the acquisition of technical and informatics skills, will contribute to remove the “black box” uncertainties, and to promote image mining towards clinical practice.

Additionally, some technical barriers should be faced when considering implementing image mining tools into the every-day practice. These include a time-consuming workflow; uncertain reproducibility of results among different scanners, acquisition protocols, and image-processing approaches; regulatory issues concerning privacy and ethics; and data protection. Common efforts should be realised to accelerate the research path on these aspects, and to implement the technological infrastructure and make the tools easy to use. Privacy and ethics regulations may restrict data and image sharing for the purpose of research and every-day clinical practice. A shared strategy needs to be built up for the management of these aspects. These challenges are an opportunity to develop a reliable methodology able to provide controlled data collection and secure infrastructure, instead of gathering uncertain-quality datasets.

The need to provide reliable results has generated multiple initiatives and recommendations to achieve methodology standardisation and reproducibility [7, 185, 191,192,193]. The increasing awareness among researchers of the urgency to increase the quality of the investigations determined an increase in the number of phase III trials in the last 2 years. In the era of evidence-based medicine, rigorous research with strict rules is the only way forward to achieve clinical acceptance and become part of the “standard of care”. The research process should aim to address a clinical need through an adequate statistical strategy, prospective and multi centre studies, robust reference standards, and adequate timing. Independent validation is mandatory together with the clarification of the impact of the technical aspects on image mining models. These items imply that reproducible, strong, and, hopefully, excellent results will be achieved through an adequate research process. In this respect, a closer collaboration should be established among clinical researchers, algorithm developers and data scientists.

We foresee the transfer learning from the well-known drug development process, with some specific adaptations, to the image mining discipline as the most effective way for radiomics and AI algorithms to get into routine clinical practice and avoid a new glacial era of image mining in the next decades.