Systematic review of machine learning-based radiomics approach for predicting microsatellite instability status in colorectal cancer

This study aimed to systematically summarize the performance of the machine learning-based radiomics models in the prediction of microsatellite instability (MSI) in patients with colorectal cancer (CRC). It was conducted according to the preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies (PRISMA-DTA) guideline and was registered at the PROSPERO website with an identifier CRD42022295787. Systematic literature searching was conducted in databases of PubMed, Embase, Web of Science, and Cochrane Library up to November 10, 2022. Research which applied radiomics analysis on preoperative CT/MRI/PET-CT images for predicting the MSI status in CRC patients with no history of anti-tumor therapies was eligible. The radiomics quality score (RQS) and Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) were applied to evaluate the research quality (full score 100%). Twelve studies with 4,320 patients were included. All studies were retrospective, and only four had an external validation cohort. The median incidence of MSI was 19% (range 8–34%). The area under the receiver operator curve of the models ranged from 0.78 to 0.96 (median 0.83) in the external validation cohort. The median sensitivity was 0.76 (range 0.32–1.00), and the median specificity was 0.87 (range 0.69–1.00). The median RQS score was 38% (range 14–50%), and half of the studies showed high risk in patient selection as evaluated by QUADAS-2. In conclusion, while radiomics based on pretreatment imaging modalities had a high performance in the prediction of MSI status in CRC, so far it does not appear to be ready for clinical use due to insufficient methodological quality. Supplementary Information The online version contains supplementary material available at 10.1007/s11547-023-01593-x.


Introduction
Colorectal cancer (CRC) ranks as the third most common malignant tumor and the second leading cause of cancerrelated death globally [1]. Microsatellite instability (MSI) is a well-established cancer hallmark that is defined as the generalized instability of the short, non-sense, repeat DNA sequences (i.e., microsatellites) due to a deficient repair system of the DNA mismatches at replication. About 13-15% of CRC patients have tumors with MSI [2,3]. It occurs more often in older patients, in right-sided locations, and has a lower pathological stage, representing a distinct CRC subtype [4].
Clinical decision-making can benefit from the information on pre-treatment MSI status for patients with CRC. Patients with MSI often have better outcomes and are less likely to have lymph node spread and metastasis [2,5]. Besides, patients with CRC MSI generally do not benefit from preoperative 5-fluorouracil-based adjuvant therapy [6][7][8]. Under this context, MSI testing has been recommended for all patients with stage II rectal patients by the National Comprehensive Cancer Network practice guidelines since 2016 [9]. Furthermore, MSI status can also serve as a predictor for the response to immunotherapy [10,11]. Previous studies have shown that MSI CRC patients are sensitive to immune checkpoint inhibitors due to the high expression level of mutant neoantigens [12,13]. Therefore, the European Society for Medical Oncology recommends MSI evaluation before immunotherapy [14] and the US Food and Drug Administration has approved MSI as an indication for cancer immunotherapy [15].
At present, MSI status is mainly evaluated through immunohistochemistry or polymerase chain reaction on specimens obtained by colonoscopy biopsy or surgical resection [2]. However, information about mismatch repair protein express level obtained postoperatively exerts little influence on the pretreatment planning, and the limited samples obtained via biopsy may not thoroughly reflect the intra-tumoral heterogeneity [16]. In some cases, a false negative result may occur (2.1-5.9%) [17]. In addition, biopsy and surgery are also invasive procedures, leaving the patients at risk of procedure-related complications and are not practical for repeated monitoring [18]. A non-invasive, reliable, and cost-effective approach to identifying the MSI status would be of great value.
Imaging modalities, such as computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography/CT (PET/CT), are commonly used for the detection, characterization, and staging of CRC. The subtle information underlying these images may reflect the genetic/ molecular alterations of CRC, such as MSI [19]. By using modern computing techniques, the imaging information can be mined and converted to quantitative high-dimension data, and the latter can be further exploited for the construction of prediction models via machine learning algorithms-this technique has been coined as "radiomics" [19][20][21][22].In recent years, plenty of studies using the radiomics approach for CRC MSI status prediction have emerged [22]. However, the reported prediction accuracy and efficacy of these radiomics models vary and the overall performance remains unknown. To date, there is not any research summarizing current evidence about radiomics methods for MSI status prediction in CRC patients. Such summaries are of clinical importance for evidence-based patient management. This systematic review was therefore aimed to summarize the current evidence and to provide a summary of the predictive performance of the radiomics models in the diagnosis of MSI in CRC. In addition, the research and reporting quality of these studies were also evaluated.

Materials and methods
This study was conducted according to the Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) guideline [23], and the checklist can be found in Supplementary file 1. The research protocol has been registered at the PROSPERO website (https:// www. crd. york. ac. uk/ prosp ero/) under registration No. CRD42022295787.

Literature search
A systematic literature search was performed to detect any potentially relevant publications at four public databases: PubMed, Embase, Web of Science, and Cochrane Library with key terms of "colorectal cancer (CRC)/colon cancer/rectal cancer/colorectal liver metastases (CRLM)", "microsatellite instability (MSI)/mismatch repair deficient (dMMR)" and "radiomics/texture analysis/radiogenomics/ imaging biomarker", their synonyms, and their Medical Subject Headings terms (detailed search queries are provided in Supplementary file 2). The literature search was first conducted on April 15 2022 and last updated on November 10 2022.

Study selection
Studies meeting the following inclusion and exclusion criteria were regarded as eligible and included in this research. Inclusion criteria: 1) retrospective or prospective design; 2) patients with CRC confirmed by postoperative histopathological examination and no history of anti-tumor therapies (i.e., neoadjuvant chemotherapy or radiation therapy) before imaging examinations; 3) radiomics features extracted from the entire volume of the lesion at CT, MRI or PET/CT examinations and used as a single predictor or one of the variables in a prediction model; 4) MSI status was evaluated on the surgical specimens; 5) publications in English. Exclusion criteria: 1) publications in the form of review, conference abstract, corrigendum, book chapter, or study protocol; 2) research outcomes not involving MSI; 3) deep learning research; 4) sample size of less than 50 patients.
Two researchers ('Q.W' and 'J.X', with 7 and 2 years of experience in preparing and updating systematic reviews, respectively) conducted study selection independently, first by screening the title and abstract and then by reading the full text of the potentially eligible studies. The disagreement was solved by discussion or consultancy with a senior researcher ('T.B.B'). In addition, review and cited references in the included articles were manually identified to detect any eligible research.

Data extraction
A predefined table was applied to extract the study information, which included: 1) basic study characteristics (for example the first author, publish year, country, and study design); 2) patient characteristics; 3) characteristics in radiomics workflow (such as tumor segmentation method, software used for radiomics feature extraction; a typical radiomics research workflow is shown in Fig. 1); 4) diagnostic performance metrics (true positives, false positives, false negatives, and true negatives) to construct a 2 × 2 table. When a study involved training and test cohorts, the diagnostic performance in the test cohort was selected for the model's prediction power. If several prediction models were developed in one study, the model with the best performance was chosen. If the study did not have a test cohort, the predictive metrics in the validation cohort were extracted. When the provided data on diagnostic performance were insufficient to create a 2 × 2 table, an email was sent to the corresponding author for the missing information. The metrics were visualized as a forest plot to intuitively evaluate the predictive performance of the radiomics prediction models, which was achieved by using the software Review Manager (RevMan, version 5.3. Copenhagen: The Nordic Cochrane Centre, The Cochrane Collaboration, 2014).
The terms "validation" and "test" were unified in this study to avoid any confusion: "validation cohort" was defined as the part of the training cohort which was randomly divided for fine-tuning of super-parameters during modeling, while "test cohort" was defined as a hold-out dataset that was externally separate from the training cohort, not involved in the modeling [24]. The test cohort could be temporally or geographically independent from the training cohort [25]. "External cohort" and "test cohort" will be used interchangeably in this study.

Assessment of radiomics quality score and the risk of bias
The tool used for methodological quality evaluation of the radiomics studies was the radiomics quality score (RQS) scale, which was proposed by Lambin and colleagues in 2017 [20]. The RQS scale consists of 16 items evaluating the research and reporting quality in the workflow of the radiomics model development. Different points are assigned to each item according to the degree the research achieves. The total points for this scoring system are 36, corresponding to 100% in percentage [20].
Research quality was also evaluated by using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) criterion [26]. This tool assesses the risk of bias in a study in four dimensions: patient selection, index test, reference standard, and flow and timing, with results marked as low, high, and unclear risk indicating different levels of risk in each domain [26].
Data extraction and study quality evaluation were performed and cross-validated by the same two researchers ('Q.W' and 'J.X',). In case of a discrepancy occurring, the senior researcher ('T.B.B',) was consulted to reach an agreement.

General characteristics and the incidence of MSI
The included studies were published between December 2019 and August 2022, and all studies were retrospectively designed (one study claimed to be prospective, but was judged as retrospective after discussion [32]). A total of 4,320 patients were included, with a sample size ranging from 90 to 837 (median 238) and a male/female ratio of 1.5 (2,592/1,728). Four studies were performed as multicenter research, with a sample size in the external cohorts ranging from 61 to 441 (median 82) [30,35,37,38]. Five studies exclusively focused on rectal cancer, while the others on CRC [29,32,[36][37][38].
Based on the surgically resected specimens, eleven studies evaluated the MSI status using the immunohistochemistry approach and one using the polymerase chain reaction method [31]. The incidence of MSI ranged from 8 to 34% (median: 19%). Among nine studies with available data, a majority of studies (8/9) reported an interval between imaging examination and surgery of less than 2 weeks [27,29,30,33,34,[36][37][38]. Table 1 provides detailed information about the basic characteristics of the included studies.

RQS and QUADAS-2 assessment
The median RQS score of the included studies was 13.5 points (range 5-18), corresponding to 38% (range 14-50%) of the full RQS score. The highest score of 50% was obtained in only one study [30]. The lowest score of 5 points (14%) was observed in an early study on this topic, and the main points were lost due to a lack of validation cohort [27]. Regarding performance in each item of the RQS, three items were fulfilled by all studies (100%): "feature reduction or adjustment," "biological correlates," and "comparison to gold standard." On the other hand, four items ("phantom study," "prospective study," "cost-effectiveness analysis" and "open science and data") were assigned 0 as none of the included studies involved them. A summary of the RQS score is presented in Fig. 3 A and B, and detailed information on the RQS score for each study is provided in Supplementary file 3.
A majority of the studies showed a low or unclear risk of bias and applicability concerns as evaluated by QUADAS-2 ( Fig. 3 C). The main source of the high risk of bias and application concern was the domain of "patient selection" due to the retrospective nature of the studies, and patient selection bias seemed inevitable. Detailed evaluation of the included studies in each domain is provided in Supplementary file 4.

Discussion
This systematic review showed that radiomics models using the machine learning approach on pretreatment imaging modalities had a high predictive efficacy, with a median AUC of 0.83, a median sensitivity of 0.76, and a specificity of 0.87. Despite these promising results, the radiomics model is still far away from clinical utility due to the insufficient methodological quality as reflected by the low RQS score. The translation of these prediction models into clinical routine settings is mainly determined by the study's validity. Ideally, a reliable radiomics signature can be developed from a prospective, large sample cohort with a study population consecutively enrolled. Although none of the included studies was prospectively designed, the largest sample size was as high as 837 and almost half of the studies (5/12) had a sample size of over 490. The median incidence of MSI in the included studies was 19%, which was a little higher than the reported incidence (13-15%) [3,[39][40][41][42]. Two studies that did not state whether the subjects were consecutively included or not had an MSI incidence as high as 33% and 34% [32,38]. That might be due to their case-control study design (1:2). In diagnostic test studies, this type of study design is prone to overestimate the performance of the prediction model and should be avoided as it cannot reflect the real-world situation [26]. One may argue that when performing machine learning algorithms, the positive and negative classifications of a cohort should be balanced to avoid potential overfitting. In fact, several techniques have been proposed to deal with this situation, such as the Synthetic Minority Oversampling Technique [43,44]. Half of the reviewed studies adopted techniques to cope with the imbalanced classifications [28,32,35,36,38].
Before translating the radiomics models into clinical implementation, it is also vital to verify the model in an external cohort [25]. Given that the model developed in the training cohort tends to be overfitting, the external cohort can be used to evaluate the generalization of a prediction model and provide a real performance of the model in realworld practice [45]. One-third of the studies (4/12) tested their models in an external cohort, yielding a median AUC of 0.83 [32,34,38]. On the other hand, internal validation using cross-validation or bootstrapping techniques within the training cohort plays an equivalent role to avoid potential overfitting and to optimize the prediction model [45][46][47]. Six studies adopted five-/tenfold cross-validation when developing their models.
Researchers should also make their prediction model reproducible and validated by other investigators. The first step could be to deposit the radiomics codes/data at a public platform (such as https:// github. com) or to provide more details on software usage. However, none of the included research published their code or data, resulting in a zero score for the "open science and data" item in the RQS scale. Besides, the models should also be presented in a proper and easy-to-use form for clinical usage, for example, present as a nomogram. Six studies provided the formula and/ or nomogram, which forwarded one step for their models validated by other centers. Furthermore, the determination # data from the test cohort (i.e., the independent external cohort); † data from the validation cohort; ‡ data from the training cohort. Note that meta-analysis was not performed to synthesize the performance metrics due to the study heterogeneity of the optimal cutoff value of the prediction model is often a trade-off between sensitivity and specificity. Its important role has been emphasized by a specific item in the RQS score. The knowledge of the specific cutoff value of a model makes it possible for other researchers to validate the model. However, only two studies stated the cutoff values of their models [32,35]. When implementing the prediction model, researchers should also be aware of the target patient population or subpopulation. The patients in the included studies had different indications, where some studies merely focused on rectal cancer or CRC stage II/III, while others were on the general CRC population.
RQS is a commonly used tool for the appraisal of radiomics research quality [20]. As it evaluates the key steps in the radiomics research workflow, RQS has the potential to become not only a guide when performing the radiomics study, but also a useful checklist when submitting their manuscript to a journal. The included studies fulfilled well in three domains of the RQS scale, accounting for 17% of the full score (6 points). Besides, more than half of the studies (7/12) reported both a discriminative performance and a resampling technique in the item of "discrimination statistics", earning an average of 1.6 points for this item. However, the included studies in this review only yielded a median score of 13.5 points (corresponding to 38% of the full score of 36) and the highest score of 18 points (50% of the full score). The main reason was that four domains in the RQS scale were not in response by any of the included studies, for example, to make their code/data public. These four domains account for 39% of the full scale (14 points). However, the RQS scale may assign a too-high weight to the item "prospective study" (7 points), which is approximately equal to 20% of the full score. This is a relatively high score given that most other items in the RQS tool often have a maximum of 1-2 point(s). However, no prospective studies were included in this systematic review, which further contributed to a lower RQS score in the included studies.
On the other hand, other appraisal tools, such as QUA-DAS-2, which was designed for the appraisal of the general diagnostic test studies, should also be adopted to complement the RQS tool in the assessment of radiomics research quality. For instance, the RQS scale does not involve patient selection, but this issue is of clinical importance when evaluating a diagnostic test study. In the QUADAS-2, patient selection is one of the four main constituent dimensions. Besides, other commonly used guidelines, such as the "checklist for artificial intelligence in medical imaging" (CLAIM) [48] and the "transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)" statement [25], may also be beneficial to conduct a rigorous and reproducible radiomics study and to improve the research and reporting quality.
There are some limitations in this study. First, the number of included studies was relatively limited, no study was prospectively designed, and only four studies validated their models in external cohorts. These limitations may undermine the conclusion drawn from our study. On the other hand, the limited number of studies, as shown by the initial records retrieved from the four databases, also reflects that this topic (using radiomics approach for predicting gene expression levels in CRC) is relatively novel and the research is still at its early stage. Second, the included studies were heterogeneous not only in the imaging modalities and phase/sequence used but also in the imaging features and modeling strategies. In this context, a meta-analysis to synthesize the diagnostic metrics was not performed and a pooled AUC for the radiomics model in the prediction of MSI status was therefore absent. Third, deep learning studies were not included due to the poor interpretability of deep learning-derived imaging features. This is also a burgeoning field where the deep learning model is often assumed to have higher accuracy than the radiomics models [49]. Lastly, although RQS is a useful tool in the assessment of radiomics research quality, it has limitations. Further revision of RQS might make it more comprehensive in the quality appraisal of the radiomics studies.

Conclusions
In conclusion, despite radiomics models derived from pretreatment imaging modalities having a high performance in the prediction of MSI status in CRC patients, radiomics does not seem to be ready to serve as an imaging biomarker utilized in clinical practice due to the insufficient methodological quality of the research.
Availability of data and materials The datasets supporting the conclusions of this article are included within the article and its supplementary files.

Conflict of Interest
The authors declare no conflict of interest.

Informed Consent Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.