Systematic review of research design and reporting of imaging studies applying convolutional neural networks for radiological cancer diagnosis

Objectives To perform a systematic review of design and reporting of imaging studies applying convolutional neural network models for radiological cancer diagnosis. Methods A comprehensive search of PUBMED, EMBASE, MEDLINE and SCOPUS was performed for published studies applying convolutional neural network models to radiological cancer diagnosis from January 1, 2016, to August 1, 2020. Two independent reviewers measured compliance with the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). Compliance was defined as the proportion of applicable CLAIM items satisfied. Results One hundred eighty-six of 655 screened studies were included. Many studies did not meet the criteria for current design and reporting guidelines. Twenty-seven percent of studies documented eligibility criteria for their data (50/186, 95% CI 21–34%), 31% reported demographics for their study population (58/186, 95% CI 25–39%) and 49% of studies assessed model performance on test data partitions (91/186, 95% CI 42–57%). Median CLAIM compliance was 0.40 (IQR 0.33–0.49). Compliance correlated positively with publication year (ρ = 0.15, p = .04) and journal H-index (ρ = 0.27, p < .001). Clinical journals demonstrated higher mean compliance than technical journals (0.44 vs. 0.37, p < .001). Conclusions Our findings highlight opportunities for improved design and reporting of convolutional neural network research for radiological cancer diagnosis. Key Points • Imaging studies applying convolutional neural networks (CNNs) for cancer diagnosis frequently omit key clinical information including eligibility criteria and population demographics. • Fewer than half of imaging studies assessed model performance on explicitly unobserved test data partitions. • Design and reporting standards have improved in CNN research for radiological cancer diagnosis, though many opportunities remain for further progress. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-021-07881-2.


Introduction
Recent years have seen an increase in the volume of artificial intelligence (AI) research in the field of cancer imaging, prompting calls for appropriately rigorous design and appraisal standards [1][2][3][4][5][6]. Evaluation of AI research requires a skillset which is distinct from those of classical medical statistics and epidemiology. The problems of high dimensionality, overfitting and model generalisation are central challenges in AI modelling [7][8][9][10]. These phenomena potentially compromise the generalisation of AI models to the reality of clinical practice [11]. However, the reliability of these models may be estimated and maximised through rigorous experimental design and reporting [1,12].
EQUATOR was founded to improve the quality of scientific research through standardisation of reporting guidelines [13,14]. Established EQUATOR guidelines such as STARD [15], STROBE [16] and CONSORT [17] were not designed specifically to address the challenges of AI research. AIfocused guidelines have recently been developed including CLAIM [18], SPIRIT-AI [19], MI-CLAIM [20] and, prospectively, STARD-AI [21]. These are welcome measures as AI remains at an early phase of clinical implementation for diagnostic tasks. Although each set of reporting standards addresses a specific task, a high degree of overlap exists between these guidelines, reflecting the fundamental importance of many of the criteria.
CLAIM aims to promote clear, transparent and reproducible scientific communication about the application of AI to medical imaging and provides a framework to assure highquality scientific reporting. Current conformity to these standards has not been formally quantified to date. Consequently, a need exists for a contemporary evaluation of design and reporting standards in the domain of cancer imaging AI research.
Following ImageNet 2012 [22], convolutional neural network (CNN) models have been adapted to various biomedical tasks. The approach is now the industry standard in AI applications for diagnostic radiology [23,24]. In this study, we aim to quantify explicit satisfaction of the CLAIM criteria in recent studies applying CNNs to cancer imaging. We examine the adequacy of data and ground truth collection, model evaluation, result reporting, model interpretation, benchmarking and transparency in the field. We identify key areas for improvement in the design and reporting of CNN research in the field of diagnostic cancer imaging.

Materials and methods
Inclusion criteria Exclusion criteria 1. The model addresses a non-diagnostic task such as preprocessing, segmentation or genotyping. 2. The model receives non-radiological images such as histopathology, dermoscopy, endoscopy or retinoscopy. 3. The article presents experiments on animal or synthetic data. 4. The article primarily addresses economic aspects of model implementation. 5. The article is published in a low-impact journal. 6. The article is unavailable in full-text format.

Search
PubMed, EMBASE, MEDLINE and SCOPUS databases were searched systematically for original articles from January 1, 2016, to August 14, 2020, for articles meeting our inclusion and exclusion criteria. Search queries for each database are included in the supplementary material. The search was performed on August 14, 2020. No other sources were used to identify articles. Screening and decisions regarding inclusion based on the full text were performed independently by 2 reviewers (R.O.S., A.S., clinical fellows with 3 years and 1 year of experience of AI research, respectively) and disagreements resolved by consensus. A senior reviewer (V.G.) was available to provide a final decision on unresolved

Data extraction
Data items were defined to measure compliance with CLAIM proposal and previously published proposals [1,18]. Complex items with multiple conditions were subdivided as appropriate. Data items are listed in Table 1. First author, journal, publication year, modality and body system were also extracted. Studies which served to validate existing models were exempt from all items pertaining to model development. Studies not employing model ensembling were exempt from item 27. Articles were read and annotated by R.O.S. and A.S., and disagreements were resolved by consensus. Articles were read in random order, using a fixed sequence generated in R [26]. Journal H-index was extracted from the Scimago journal rankings database [27]. Journals were categorised as either "clinical" or "technical" according to the journal name-names containing any term related to computer science, artificial intelligence or machine learning were assigned the "technical" category. The remaining journals were assigned the "clinical" category.

Data analysis
Statistical analysis was conducted using R version 3.5.3 [26] and RStudio version 1.1.463 [28]. For each item, the proportion of compliant studies was measured, excluding those with applicable exemptions. For items with ≥ 3 response categories, proportions were also measured for each category. Ninety-five percent confidence intervals (95% CI) were estimated around binary proportions using the method of Clopper and Pearson [29] and around multinomial proportions using the method of Sison and Glaz [30,31]. Following adherence assessment recommendations [32], an overall CLAIM compliance score was defined per article by the proportion of applicable items satisfied. Items and subitems were weighted equally.
CLAIM compliance ¼ number of items satisfied number of items applicable Temporal change in CLAIM compliance was evaluated by two-sided test of Spearman rank correlation between CLAIM score and year of publication. Association between journal impact factor and compliance was evaluated with a twosided test of Spearman rank correlation between journal Hindex and CLAIM score. The difference in mean CLAIM compliance between clinical and technical journals was evaluated with a two-sided t test. All code and data required to support the findings of this research are available from the corresponding author upon request. As a methodological review assessing study reporting, this study was not eligible for registration with the PROSPERO database.

Search
Six hundred fifty-five articles were identified in the primary database search, of which 267 were duplicates. One hundred twenty articles were excluded during title screen, and 82 articles were excluded during abstract screening. One hundred eighty-six articles were included in the final analysis. A flow diagram for the literature search process is provided in Fig. 1. The dataset included articles from 106 journals. Fifty-four clinical journals and 44 technical journals were included. Assigned journal categories are provided in Supplementary  Table 1. The distributions of article publication year, body system and modality for are provided in Fig. 2.

Data partitions
Eighty-seven percent of studies reported the number of images modelled (161/186, 95% CI 81-91%), though only 1% provided a power calculation (1/186, 95% CI 0-3%). Seventytwo percent specified the number of study participants in their dataset (133/186, 95% CI 64-78%). Of these, a median of 367 participants were included (IQR 172-1000). Seven studies Fig. 3 Compliance with CLAIM items 1-13. Compliance rate is defined as the proportion of articles subject to that item which satisfy it. Exemptions are provided in Table 1. Point estimates and 95% confidence intervals are reported served only to validate existing models and were exempted from criteria pertaining to model development and data partitioning. Seventy-six percent of modelling studies defined data partitions and their proportions (136/179, 95% CI 69-82%), though 32% specified the level of partition disjunction (58/179, 95% CI 26-40%).

Model
Sixty-six percent of modelling studies provided a detailed model description (119/179, 95% CI 59-73%) and 20% of modelling studies provided access to source code (

Discussion
Radiological AI is undergoing a development phase, reflected in growing annual publication volume and recognition by clinical researchers [33][34][35][36][37]. To safely harness the potential of new methodologies, clinicians have called for realistic, reproducible and ethical research practices [1,[38][39][40][41][42][43][44]. The CLAIM guidance sets stringent standards for research in this domain, amalgamating the technical requirements of the statistical learning field [9,45] with the practicalities of clinical research [1,2,15,46]. We observed improvements in documentation standards improved over time, a finding concurrent with previous reviews of AI research [43,45]. Compliance was highest in impactful clinical journals, demonstrating the value of design and reporting practices at peer review.
A key opportunity for improvement is model testing, addressed by items 20, 21, 32 and 35. Documentation should specify three disjoint data partitions for CNN modelling (which may be resampled with cross-validation or bootstrapping). Training data is used for model learning, validation data for model selection and test data to assess performance of a finalised model [47,48]. Half of studies documented two or less partitions-in these cases, results may have represented validation or even training performance. Where data partitions were not disjoint on perpatient basis, data leakage may have occurred despite partitioned model testing. These scenarios bias generalisability metrics compliance in clinical journals and technical journals. Journals were categorised as either "clinical" or "technical" according to the journal name-names containing any term related to computer science, artificial intelligence or machine learning were assigned the "technical" category. The remaining journals were assigned the "clinical" category optimistically. Some multi-centre studies partitioned data at the patient level rather than the institutional level, missing an opportunity to evaluate inter-institution generalisability.
Evidently, CLAIM has also introduced requirements which depart from current norms. Few studies satisfied item 12, which requires the documentation of data anonymisation methods, an issue which has developed with image recognition capabilities [41,49,50]. This requirement may have previously been relaxed for studies of publicly available data or those which documented institutional review board approval, as either case suggests previous certification of data governance procedures. The spirit of the CLAIM guidance is obviation of such assumptions with clear documentation, promoting a culture of research transparency. In many such cases, the burden of improved compliance is minimal, mandating only the documentation of additional information.
Our findings concur with previous reviews of design and reporting standards in both clinical and general-purpose AI research. A review of studies benchmarking AI against radiologists identified deficient documentation of data availability, source code, eligibility and study setting [38]. Reviews of TRIPOD adherence in multivariate diagnostic modelling found deficient model assessment and data description [12,51,52]. Reviews of reproducibility in AI research have reported insufficient documentation of data availability, source code, protocols and study registration [43,45,53]. Many commentators have advocated for transparency in clinical AI research [19,38,40,42,43,53,54].
We note several limitations to this systematic review. First, as scope was limited to studies published in English, findings were susceptible to language bias. Second, although reporting standards were directly measurable, items relating to study design were only measurable if reported. Consequently, design compliance may have been underestimated in poorly reported studies. This is a general limitation of reviews in this field. Third, articles were read sequentially and therefore readers were potentially susceptible to anchoring bias. The effect of anchoring on the trend and subgroup analyses was minimised by randomisation of the reading order.

Conclusions
Design and reporting standards have improved in CNN research for radiological cancer diagnosis, though many opportunities remain for further progress. The CLAIM guidance sets a high standard for this developing field, consolidating clinical and technical research requirements to enhance the quality of evidence. Our data supports the need for integration of CLAIM guidance into the design and reporting of CNN studies for radiological cancer diagnosis.

Declarations
Guarantor The scientific guarantor of this publication is Vicky Goh (Vicky.goh@kcl.ac.uk).

Conflict of interest
The authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.
Statistics and biometry No complex statistical methods were necessary for this paper.
Informed consent Written informed consent was not required for this study because this was a systematic review using published studies in the literature but not analysing specific human subjects.
Ethical approval Institutional review board approval was not required because this was a systematic review using published studies in the literature but not analysing specific human subjects.

• Retrospective • Multicentre study
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.